 A lot of scientific research is propelled by gadgets, instruments, things like microscopes, telescopes, thermometers. These inventions seem pretty simple now, but when they were invented, they opened up whole new worlds of discovery, universes in drops of water, distant worlds with their own moons. Scientific instrumentation today is a bit more complicated. Things like this, the Large Hadron Collider in Europe, crack open atoms, tried to find new kind of physics to reconcile the problems in our existing theories of physics. But the data that comes out of something like the Large Hadron Collider is not really useful without parallel advances in statistical computing, in data analysis algorithms, and in the hardware that makes it possible to move all this data around. And together with that, we require some parallel cultural evolution in how to use the data processing and scientific computing responsibly so that it actually gives us answers and not just delusions. Here you see scientists of a previous era pipetting with their mouths, something that would be considered completely unsafe and even grounds for firing in many laboratories. Lots of statistical computing is unfortunately like pipetting with your mouth. For this lecture, I want to talk about the scientific computing part of drawing the Bayesian owl, this workflow from the first lecture. It's important to get the theory right in step one, have some clear communicated theoretical estimate. You need to have some way to translate scientific assumptions into structural causal models. And then you need to use logic to get that causal logic into statistical modeling. But in step five here, I think I've let you down so far in the sense that I haven't really said that much about the scientific computing part of data analysis for realistically complicated problems. So far we've done basic linear regression, and I've hidden a lot of the computational details from you. And to a large extent we can do most of our scientific work with the computational details in the black boxed away where the experts have safely secured it. But for this lecture I want to open that box a bit because there's a lot about the way that modern computing works, especially Bayesian computing, that requires you to know something about what's going on inside the golem, as it were. So we say we're just going to analyze the data. If your response is to be a bit sarcastic, that's fair. There's a lot that goes on in this little step five of analyzing the data. There are lots of different machines you can use to compute the posterior distribution or get some approximation rather of the posterior distribution. The first is this analytical approach, the classical approach. I've shown no examples of it in this course because it is completely unimportant, aside from understanding the logic of Bayesian analysis. But it's not useful for even simple problems. The second is grid approximation, which I showed you in the second lecture. Grid approximation is a great way to understand what Bayesian updating is about, as it reinforces that it's just counting, counting all the possibilities for every possible explanation of the data, and then showing the relative numbers of ways that each explanation is compatible with the data. But grid approximation is just too computationally intensive. As models get more complicated, that is the dimensionality of the problem grows, even the fastest computers in the world are not fast enough to solve your problems within your lifetime. And then the third we've been using up to this point is quadratic approximation, or Laplace approximation is sometimes called, and this is where we approximate the posterior distribution as a multivariate normal distribution. This is a really valuable technique. It'll continue to be valuable for a very long time, but it's limited in the sense that lots of problems are not multivariate Gaussian. And we're going to start to use problems of that kind, or encounter problems of that kind in this course now, and so we need some other option. So this lecture is about number four, Markov chain Monte Carlo. Markov chain Monte Carlo is intensive, but it's not very intensive. It's intensive in the sense that models that would take a second to fit with quadratic approximation might take a minute, or half a minute to a minute with Markov chain Monte Carlo. Maybe that doesn't sound like a bad ratio to you, but when you get into things that take 10 minutes with quadratic approximation, they could take hours or days with Markov chain Monte Carlo. So why would we ever use it? Well, the answer is because it's much more capable. It has many fewer limitations, and in fact it has become essential in many areas of science because it makes it possible to fit models to data that essentially nobody could fit to data before. Before we get into the details of it though, let's think sort of as an allegory, and I'll walk you in a narrative way into what Markov chain Monte Carlo is. Imagine an oceanic king, a benign monarch named King Markov pictured here, and King Markov is the benign ruler of an archipelago, the metropolis archipelago, and this archipelago has a number of islands in it, and the islands are different sizes, and so they have different populations on them, different sizes of populations on them. And as a benign monarch, King Markov has this contract with his people that he will visit them, that his people love him, so he agrees to visit them. He will visit them in proportion to the population sizes of the islands they're on. And his advisors have gotten together and figured out a way that he can do this without any kind of calendar schedule because King Markov likes to live free. He's a bit of a bohemian. So here's how he does it. So he does a visit on an island, and then at the end of that week of his visit, he flips a coin to choose the island to the left or the right of the current island he's on, and he calls that island, whichever the coin chooses, he calls that the proposal island, and there's a half chance of proposing the left or the right. So suppose the proposal ends up being the right in this case. Then he will ask someone, hey, that island over there, what's its population size? So he gets an estimate of the population size of the proposal island, let's call that piece of five. And he compares that to the population of the current island, which we'll call piece of four. And then he moves to the proposal island with a probability equal to P5 over P4. Now this number can be greater than one, in that case he always moves, but if it's less than one, then he uses a bag of seashells or some dice as a randomization method to obey the probability or a spinner wheel or some mechanism of that sort. And then he repeats from step one. He spends a week on either he moves or he doesn't. If he stays, he spends another week on island four, the people of island four rejoice, or he moves to island five. In this example, and the people of island five rejoice, he spends a week there. Either way, at the end of the next week, he repeats it all over again and starts over at step one and flips a coin. It turns out, as weird as this procedure sounds, it ensures that in the long run, King Markov will visit each of the islands in proportion to its population size. But only in the long run, in the short run, of course, he could stay on island four for ten weeks in a row. It's not likely, but it could happen. Let's take that algorithm and animate it so you can get a sense about how it functions. So what I've drawn on this slide on the left is the archipelago represented as a ring so that you can loop from the smallest island, which is labeled with the number one. These numbers are relative to the population sizes of the islands and they're drawn so their areas are proportional to the numbers on them. And the largest one with the red dot on it, that's where King Markov is going to start in the simulation, that's the tenth island. And on the right, we have a histogram that is going to count up the numbers of visits he has made to each of these islands starting from now. And you'll see that he makes proposals and he leaps across on his little royal boat, or he stays, and then we count up weeks of visits on the right and you'll see that the bars grow. And at first, it's just a jagged mess. There's no real pattern because it's incredibly random. There's no long-range plan here, remember. King Markov is not planning for the future. He's just deciding every week whether he stays or moves and he does so with a random device. But quickly, you start to see that he spends more time on big islands because jumps to bigger islands are more likely than jumps to smaller islands. The jumps to smaller islands are possible. So in the very long run, after several hundred visits, you'll see that the pattern is well established and he tends to visit each of the islands in direct proportion to its relative population size relative to the other islands in the archipelago. This algorithm, of course, it doesn't really have anything to do with oceanic monarchs. This is an example of Markov chain Monte Carlo. Our usual use of it in scientific computing is to draw samples from a posterior distribution. Really any distribution, but we're going to use it in this class for posterior distributions. The islands, in the example, are the parameter values, the explanations of the data that we've been interested in up to this point in the course. And the population sizes are the posterior probabilities of each of those parameter values, each of those explanations. The relative numbers of ways that that parameter value or island could explain the data. And what the algorithm that I just showed you does is it guarantees that in the long run that the computer will visit each parameter value in proportion to its posterior probability. And all you need to be able to do computationally is compute the posterior probability for any specific combination of the parameter values one at a time. So you don't have to evaluate, like in grid approximation, the entire grid. You just do the proposal and your current location. So it's just two posterior probabilities, two little spots in the grid. And that's what you need to know if you move. And at every step, every proposal step, again, there's just two numbers to compute. This algorithm scales up to more dimensions. In the island's example, there's just one dimension. That is, there's just the location, the island that the King Markov is on. But of course, we often have many parameters in our models. And this algorithm can handle that as well. It's called a Markov chain because it's a temporal sequence. That is, the values are sampled in sequence. And the next sample or next proposal only depends upon the most recent one. So as I'm showing here on this slide, you can represent it as a migration through time from the left to the right with increasing samples. This is the King Markov on the far right there wandering around. And the path behind him is the Markov chain, so to speak. The chain of visits that he's been at. You can see he tends to spend more time near the top of this plot because those are the populous islands. But he does wander to the islands with smaller populations. And he does so in the long run in proportion to their population size. So this is a key feature of the Markov chain is that what makes something a Markov chain is that history doesn't matter. Where you're going to go next only depends upon where you are now. And that's why King Markov doesn't need records or any planning or any calendar. And that's what he likes about it. The Monte Carlo part of Markov chain Monte Carlo just means some numerical algorithm that uses randomization to perform a calculation. And there are lots of Monte Carlo techniques in the sciences. Markov chain Monte Carlo is only one of them. So here I've zoomed out on that same chain. I've just sense now of how the visits aggregate in the more populous islands. You can see it a little better. This particular algorithm that's animating at the top here is known as the metropolis algorithm. And it's a very simple version of Markov chain Monte Carlo. And or MCMC as it's usually abbreviated. In fact, it's the original one. And I'll talk more about that in a moment. It's very easy to write it. So in the book, I have a small example. You can do it in half a dozen lines of our code or any other scripting language. And it's very general. You can apply lots of problems to it. However, it's often inefficient, especially as the dimensionality of the problem grows. It can be quite hard to tune it and make it work. This metropolis algorithm comes from this paper shown here. It's a very famous paper in the computational sciences. The Metropolis Rosenbruth, Rosenbluth Teller and Teller in 1953 with the very exciting title, Equation of States Calculations by Fast Computing Machines. This is one of those papers that started a whole new field. Actually, not just a field, but a bunch of solutions that span individual scientific fields. It's been cited over 47,000 times by the latest count. This is an incredible piece of work. And the bulk of the computational work, I think nearly all of it was done by one person. One of the authors, Ariana Rosenbluth, who's pictured in the upper right of this slide. She's also the person lunging there in the foreground on the left, the fencer. Ariana Rosenbluth was an Olympic fencer, qualified for the Olympics, but World War II prevented her from competing. And then she went on to get a PhD in physics from Harvard and then to work on the Manhattan Project in Los Alamos, where she programmed machines to do all manner of simulations, including the first metropolis algorithm simulation, the first scientific application of Markov chain Monte Carlo. And a lot goes into this. Now, lots of scientific computing students learn how to write these algorithms, but to invent one is a whole different business, and to make it work with the computer hardware at the time is indeed quite an achievement. So this is the computer that Ariana Rosenbluth had to work with. This thing was called Maniac. I'll tell you why it's called that in a moment. Maniac is a huge number of transistors and vacuum tubes and other things. And by modern standards, a very feeble computer, but at the time it was one of the most capable ones. And at Los Alamos, they were putting it to use for lots of different purposes. Nick Metropolis, who gave his name to the Metropolis algorithm, is pictured here in the foreground on the left, smoking the cigarette. And I'm not sure if he called the computer Maniac or someone else did, but it's a bit of an inside joke, obviously. It stands for mathematical analyzer, numerical integrator, and computer. And the idea was that scientific acronyms had already become absurd at the time, and so they were playing with the game as a bit of a pun. Maniac weighed 1,000 pounds, had only 5 kilobytes of memory, and could do 70,000 multiplications per second. Now, 5 kilobytes of memory and 70,000 multiplications per second may sound like a lot if you're a person, because you may not have direct short-term access to 5 kilobytes of memories, and 70,000 multiplications per second is not something you could consciously do. Unconsciously, perhaps, but not consciously. But in contrast to your laptop, even a cheap laptop that's 5 to 7 years old, will only weigh 4 to 7 pounds, have more than 8 million kilobytes of memory, and be capable of billions of multiplications per second. Maniac was capable at the time, but it was very difficult hardware to program in. The memory was cramped, the operations were limited, and most of the programming tools we take for granted now, high-level languages, compilers, debuggers, simply didn't exist. So Ariana Rosenbluth had a really incredible task here. Before I talk about the details of these algorithms, or rather show you heuristically how they function, I want to give you an idea about how big this topic is now. Markov Chain Monte Carlo is a huge field, and there are people that's been their whole careers just working on these algorithms, inventing new ones, inventing diagnostics for them, finding new ways to make them efficient. And so the sort of plain metropolis algorithm that Rosenbluth invented, or rather found a computerized algorithm for, mathematicians came up with the basic strategy, isn't of much use anymore because we have much more efficient algorithms to use, but that's not to disrespect it at all because without the metropolis algorithm, or maybe we should call it the Rosenbluth algorithm, we wouldn't have the new ones either. There have been a lot of innovations, especially in the last two decades, and all of the new best methods for Markov Chain Monte Carlo use gradients. Gradients, I'll say more about this later, but the gradient has to do with the curvature of the distribution you're sampling from the local curvature at each point. I'll say why that helps as we go forward. The gradient method we're going to talk about in using this course is called Hamiltonian Monte Carlo, and the Hamiltonian is a term for physics that has to do with the energy in a system. And it has to do, I'll say why in a bit, but let's still talk about the Rosenbluth algorithm, usually known as the metropolis algorithm. The way it works, as you saw with King Markov, was to make proposals and then consider that proposal in relation to your current position. So in continuous space, how does that work? I'm showing you on this slide is a typical normal distribution estimation problem where we've got the mean of a normal distribution on the horizontal axis and the standard deviation sigma on the vertical. But we're showing the logarithm of the standard deviation so that it's continuous, so there's no boundary at zero. So the negative numbers are below one. If you exponentiated them, they'd be above zero. So what you're seeing with the blue, I should say, the rings on this slide represent the shape of the posterior distribution that we want to draw samples from. If we knew it, it would have this shape, it would look like a hill, or rather I want you to think of this thing as a bowl where in the center of this slide is the region of highest posterior probability that's down in a valley, things tend to sink into it. And then as you move towards the edge and the contour lines get closer together, those are ridges that are rising up and those are regions of very low posterior probability. And we don't want to visit those places very often because they're like islands with small population sizes. Then the blue line you see there is the first proposal and this chain is going to animate and what you'll see is that if there's a red dot, that's a rejected move through the randomization device and otherwise it will move and there'll be a new black dot. And then those points are the samples that we collect. And if we pile up all those samples, we get the posterior distribution. If we get enough of them, we get an approximation of the posterior distribution. So you can see this thing about these algorithms is that they only act locally, they don't know the posterior distribution. That's why we need these methods because we don't know the posterior distribution. We can't sample from it directly, but we can blindly stumble around it miraculously and draw proper samples from it. And that's what these Markov chain Monte Carlo algorithms allow us to do. And they're often a most efficient solution for directly sampling from arbitrary probability densities. One of the things about the Rosenbluth algorithm that is frustrating is that it has to be tuned. So the size of the proposal, that is how far from your current location you consider moving has a big effect on how efficient the algorithm is. And basically you lose at one extreme or the other. So you have to tune these things in general applications. So on the left I'm showing you the animation from the previous slide and what I'm calling the large step size version of the simulation. You'll see that the blue bars are often quite large. This simulation is willing to consider points very far from the current location. As a consequence, lots of them are red, meaning that they're rejected. And then the changes stay still for many steps in a time as it is right there. And that wastes a lot of computation. What you'd really like is for every proposal to be accepted to always propose points that are valid in the posterior distribution and to always move there. And that's what you'll be able to do with this algorithm. On the right you've got the other extreme small step sizes only considering very local proposals. This is like King Marco's algorithm where he only considered neighboring islands. When you do it this way, you only very rarely reject or much more rarely reject moving. It still happens. So you take more samples but the samples are really close to one another and as a consequence each additional sample doesn't add a lot of information about the shape of the distribution. In contrast on the left, when you do manage to move you learn a lot more because the points are less similar to one another after a move. So all computational methods have trade-offs like that so there's nothing special about that. But it turns out we can do a lot better and that's what these modern gradient methods do. So let's consider an analogy before I show you what Hamiltonian Monte Carlo looks like. I want you to think about skateboarding and skate parks. So here like this bowl is what I would call this a big bowl and the very talented young man there who is skating there's a sense in which of course his skateboard doesn't know anything about the shape of that bowl and yet it rolls effortlessly around it and because the potential energy as he rolls around this thing is highest on the edges it will naturally turn him around under the force of gravity and he will tend to spend more time near the bottom and so this is like the high probability region of the posterior distribution whereas the edge is in the low probability region because he can't stay there as a moving skateboardist as it were. So what if we could use a strategy like this to sample from arbitrary distributions represent them like bowls they don't have to be perfectly Gaussian bowls they can have odd shapes like the one you see on your slide and then we put virtual skateboarders to thrash around in there we flick them through the bowl and let them do a couple of tricks and as we track their positions their positions over time we'll tend to take samples from the shape of this bowl because they spend more time near the bottom than near the top and it turns out you can do this and this is what the algorithms I'm going to show you next actually do and it seems weird that we can do this but we really are we're going to run skateboard simulations or frictionless skateboarder simulations inside the computer so we can take samples. Okay, how does this actually end up looking? Now I've got my virtual skateboarders lined up on the edge of the bowl there there's that little blue tick he's about to start rolling and it's the same Gaussian posterior problem as I had before with the mean of a normal distribution to be estimated on the horizontal and it's log standard deviation on the vertical. Now what's going to happen here remember is the middle of this contour is the center of the bowl, it's the bottom and then out on the edges we've got the ridges, the edge of this thing and so we're just going to start a physics simulation literally and it's going to be a frictionless skateboard and it's going to roll following we're going to flick it with momentum in a particular direction a random direction in fact and then we're going to let physics do the rest and we're going to let it run for a little while as you see here and it follows the contour and rolls back down and then we stop it after some particular length of path and we record the position and that's what those black dots are and then we flick the skateboarder again in a random direction we let the physics simulation go, turns around, record the position and every one of these trajectories they're called results in a sample from the posterior distribution now you have to be careful exactly how you do these calculations so that it's a proper Markov chain but it's not that complicated to do it and the mathematicians have figured out the details for us to make it work this is a fantastically very useful sort of thing but it requires the gradients as I said the additional thing you need to know here is the local curvature that is you need to know what the skateboard knows you need to know whether you're going uphill or downhill in any particular local position so that you can run the simulation because this is not a metaphor that I'm running a physics simulation here this really is a physics simulation, a little frictionless particle rolling around on the posterior distribution and the samples just come from its position at fixed points in time so these methods also require some tuning and you can have really long trajectories like on the left and let the skateboard roll for a long time and since it's frictionless it'll roll forever actually it'll never stop this is not always the most efficient thing to do because often it'll loop back and you get this phenomenon I talk about more in the book called the U-turn phenomenon where the trajectory turns around and so you end up with samples that are really similar to one another because they end up next to one another as you see on the left and that's not optimal now in the long run it'll be fine and you will take valid samples from the posterior distribution and explore it, it's just less efficient on the right we've got the other extreme we have really short steps you see the little increments in the blue worm there that's called step size and when the trajectories are short and have small steps exploration is slower but in a few jumps you can still traverse the whole shape of the thing and get valid samples it just may take longer to explore the whole shape of the distribution modern engines as I'll talk about in a little bit figure out these parameters for you they have an adaptation algorithm in them to try and figure out the right step size and length of trajectory that is how fast and how long the skateboarder should go so that you can efficiently explore the posterior distribution now of course the distribution on this slide is not hard to sample from and the basic Rosenbluth or Metropolis algorithm will have no problem sampling from it either we don't really need Hamiltonian Monte Carlo for this distribution it's a bit of overkill although it's fun to watch where Hamiltonian Monte Carlo really shines and becomes essential is in more complicated posterior distributions and the thing about with many dimensions and the thing about more complicated posterior distributions which arise even in modestly more complicated models than the ones we've used so far in the course often have very high correlations in them between the parameters so let me show you a really simple example of that this is still only a two-dimensional example because who can see 3D or 4D on a slide right so we'll stare at this and even though it's only two-dimensionals imagine it's very high-dimensional and these correlations exist between a bunch of parameters simultaneously what your Markov chain sampler has to do is move along some hyper-dimensional thin ridge like the oval on your screen here these shows again the contours of the posterior distribution and the oval in the center is the high probability region that's where we want to spend our time that's where the skateboarder will spend his time and the more closely packed ridges are the low probability regions where gravity forces you back towards the middle and we can... Metropolis has a really hard time with distributions like this because it tends to... since this is such a narrow shape in the middle Metropolis will tend to make proposals in the bad part outside the ridges and almost all of them will be rejected and so it's hard for Metropolis to move around a space like this because it doesn't know anything about the global shape the thing about the skateboard that is really cool even though the skateboard doesn't know anything about the global shape either the fact that it senses the curvature where it is it knows which direction to roll means it effectively explores the global shape quite efficiently that is it won't roll uphill and that means it makes good proposals so let me show you what that looks like in animation here's our skateboarder, we started him on the edge now he rolls around in the middle and takes some samples and you'll see that even though at any point in time the simulation only knows the local curvature it's not calculating the whole posterior distribution it doesn't know it, it has to get that from the samples it still manages to find the shape of the oval in the middle of the slide and take samples from it and make these zigzaggy snake patterns and all the algorithms... there are other algorithms that also use gradients and they can be as efficient as Hamiltonian Monte Carlo and that's because what they all share is this ability to sort of find the global curvature by considering local curvature is eventually find global curvature by considering local curvature they're not magic and so yes, with enough samples as you see here you can get the whole shape pretty rapidly okay, that's Hamiltonian Monte Carlo if you're interested in the actual details of the algorithm you want to see what it looks like in code on pages 276 to 278 of my book I show you how to do this with basic r-scripting you can write the simulations I just showed you with basic r-scripting it's not that complicated and I take you through all the justifications for the different parts and the key thing about this the physics simulation part is actually pretty easy it's just a loop that calculates velocities and updates the position of a particle it's not, maybe off the top of your head you wouldn't know how to write that but once you've seen it you'll agree it's just a little loop with some updates of positions the tricky part is that for every model you write and want to analyze this way you have to calculate the gradient somehow you need the derivative for every parameter of the log posterior probability and for big models that means a lot of derivatives and it means a lot of calculations and a lot of custing programming for every problem we're not going to have to do that but if you're interested in that I explain more on these three pages in the book okay, so gradients, yes keep talking about gradients and we need them here's another video to give an impression to you about how the gradient governs the motion right so here's the skateboard's view as it were of moving along a half pipe a particularly deadly half pipe with this bizarre trap in it and the skateboard doesn't know anything about the global shape but it can find it and all it needs in any particular point is its curvature and so that's what we need to calculate for any arbitrary half pipe structure so how do we do that? how do we maintain our convenience of writing general statistical models and calculate the curvature of Tony Hawk's half pipe so we need gradients and as I said you can write them yourself you could go and measure any particular half pipe and get the curvature at every point and you could represent that knowledge with giant tables that you could look up as you do the simulation but that's not very efficient or you could have equations if you can describe the shape of the posterior distribution with general equations the general problem is that that means we have to recode the Markov chain for every model you don't want to do that, you don't want to write them yourself instead there's this fantastic scientific computing technique called AutoDiff which stands for automatic differentiation and what AutoDiff does is it takes your computer code and it computes symbolic derivatives that is the gradients from your model code that is to say when you write a statistical model just like the ones we've been writing in this course up to this point the writing is an expression, an analytical expression for the posterior probability and that defines the distribution we want to sample from that is you can plug in fixed values for all the parameters and you can get the relative numbers of ways that the data could arise from that combination of parameter values that's the posterior probability of course we want to evaluate it evaluate the posterior distribution for all combinations of parameter values and that's a tall order if you had the derivatives of that expression for the posterior probability you'd know which way was uphill and downhill from any particular combination of parameter values you had and in mathematics when we take a set of derivatives like that from a function you get a matrix that's called a Jacobian shown on the right hand side here and it's just a matrix of derivatives for some particular function here capital B for some number of input variables in our example these would be parameters you take the derivative with respect to that parameter and this gives you information about whether it's uphill or downhill for any particular parameter if you increase that parameter with the probability go up if you decrease it would it go down an auto-diff does this for you that's why it's called automatic differentiation it can do this because taking derivatives is relatively speaking easy there are rules for it and your computer is very good at following them still the computer science that goes into these algorithms is really amazing and we're lucky that people have written them for us and we can just use the libraries that make it work automatic differentiation is not unique to Hamiltonian Monte Carlo or even Markov chain Monte Carlo it's used in a lot of machine learning approaches for any kind of high dimensional machine learning that goes on these days which is most of it let's be honest you need gradients to search the space because there's too many parameters and the parameter space is just too big to do any kind of brute force thing and so for example some of you will have heard of back propagation which is used a lot in neural networks back propagation is a special case of automatic differentiation okay so I said you're not going to have to do any of that yourself that's the good news instead we're going to use the Stan math library to do the automatic differentiation for us you'll be able to continue writing the model definitions as you have been so far and Stan will compute the gradients for you and this makes things really nice Stan is a very professional open source scientific computing project that is used both in the sciences and in industry quite a lot now and Stan is not an acronym it doesn't stand for anything Stan is a man or was a man it's named after Stanisław I think that's pronounced it's a Polish name Stanisław Ulam who lived between 1909 and 1984 and Stan is credited as the father of Monte Carlo Methods he's a brilliant mathematician he fled Europe during the Second World War like many brilliant European mathematicians and went on to make very important contributions in many areas of mathematics, physics, and biology really an extraordinary career the Stan math library is named after him because of Stanisław's contributions that paved the way for later innovations in Monte Carlo Methods oh yeah, here's Stanisław with his daughter Claire when she was quite young at Maniac because they all worked at Los Alamos and Stanisław did analytical mathematics to design the calculations and then Ariana Rosenbluth and others programmed the computers to actually execute them okay, that's a lot and it's mostly conceptual work so far but nevertheless I think we should pause and as always it'd be a good idea to take a look back through the slides so far see if there's anything that remains confusing take a couple notes then go for a walk, have a cup of tea or coffee and when you come back, I'll be here welcome back let's pick up by actually doing some work now so what I've tried to do so far in this lecture is give you a conceptual heuristic and a little bit of a historical introduction both Markov chain Monte Carlo and especially Hamiltonian Monte Carlo now I want to show you more of the workflow that we do with this in a typical applied Bayesian data analysis scenario so let's revisit an example we already know so I don't have to introduce a new data set let's think of the divorce data again from last week all I'm doing on this slide is trying to remind you of the model we used, one of the models we used when we analyzed that data set this is a linear regression I'll show you the mathematical model notation on the right of the slide, the code on the left it's a linear regression with two predictors that is we're stratifying divorce rates simultaneously by marriage rate and agent marriage that's the D for divorce rate M for marriage rate and A for agent marriage no surprises I hope by this point and the code on this slide is the standard quadratic approximation code you would have used before all that's different about it is I've defined the formula list as a separate object and then passed it into the quap call at the bottom of the code block there to fit this same model with Hamiltonian Monte Carlo if you're using my rethinking package all you need to do is pass the same formula list to the ULAM function shown at the bottom and what ULAM will do is it will actually scan code and it will pass the whole job over to the Stan Math Library let it compute the gradients, run the physics simulation, collect the samples and then ULAM will return those samples to you and you can analyze them just like the samples from quap that is all the same convenience functions that you've been using so far to analyze quap models like link and extract samples will also work for models you fit with ULAM all the machinery is different under the hood but everything you've learned so far about how to analyze posterior distributions is still true and you've already learned how to work with samples, in this case this is all we get, all we get are samples there is no analytical representation of the posterior distribution when you do Markov chain Monte Carlo but now you're already a pro at that and you're ready to go so you run this, I encourage you to run this, make sure you get everything installed right and see if you can reproduce the results that I'm showing you here, this is the precy output for the divorce model fit with Hamiltonian Monte Carlo and the means and standard deviations and compatibility intervals are practically identical, this is a standard linear regression we don't need Hamiltonian Monte Carlo for a problem like this but this is where we start so that you understand that it's giving the same results in principle and as we scale up to bigger kinds of models, quadratic approximation simply won't work at all and we're going to have to do it with Hamiltonian Monte Carlo what's different in the precy output I'm sure you've already noticed are these two new columns on the right the NF column you can say that NEF if you like I tend to say NF and R hat for I'm going to explain what those are in a bit let's just hang on a second but what they're about is diagnosing how well the simulation has functioned how well the skater performed as it were before I do that though before we talk about the workflow and diagnosing the machine's performance I want to talk about what this model would look like in pure stand code so ULAM is a convenience function so that you can keep working with those standard kind of mass stats model formulas but what's really going under the hood scientific computing requires more input than that because the mathematics is ambiguous in those formulas there's a lot of intuitive knowledge that you bring to the context but the computer needs to be told that because it doesn't know it now if you're not interested so what I'm going to do in the next couple slides is show you what the stand code is like and give you some first meeting with stand just so you get it so it's less mysterious if you're not interested in working with stand right now you can free to skip this section and keep going you can check the chapter bookmarks on the YouTube progress bar on the time progress bar at the bottom of the video okay so stand code is worth learning on its own eventually I think about my rethinking package as just a scaffold to get you into using stand or some other similar programming language like tour in the Julia environment and the reason is because these probabilistic programming languages like stand or touring are incredibly expressive and you can do a lot more with them if you get comfortable with programming in them directly they're also very portable in a way that our code is not of course ours portable on lots of devices you can probably run it on your watch now but the same stand model can be compiled in any scripting language because stand doesn't depend upon any particular scripting languages and that means that you can share it with lots of colleagues in high performance computing on the right I'm showing you the stand code for the force model and you can always get oolong to spit out the equivalent stand code by typing the command at the top here that I've commented out the stand code function and then the name of the fit model project and then it will display the stand code that was actually run let me explain each of the parts of this for you and I'm not going to go into intricate detail here I just want to give you an idea of the general personality of each of these blocks so the top part of every stand model is the data block and this is where we declare observed variables it's things usually that we call data but they're observed variables and these are the data analysis so these are the divorce rates the ages at marriage and the marriage rates and you'll notice that the names are the same here D, A and M but they're all prepended by this word vector and then in brackets 50 50 is the link there are 50 observations because there are 50 states in the United States and vector is the kind of data it tells stand how the variable is represented in memory and this is if you haven't programmed in languages like C that require this this seems really strange because R and Python don't normally require you to say anything about the representation of the variable in memory maybe once in a while you've coerced something to be an integer so you've gotten a slight whiff of this and what it's about this is essential though in high performance computing for code that's going to be compiled down in machine code you have to tell the computer unambiguously what the memory representation is going to be the reason is because this allows stand to catch errors in your coding that is you try to do something that's not allowed with a particular type and that's really indispensable because it reduces the amount of debugging you need to do basically it creates internal consistency checks of the code it's a way of proofing the code as you write it and the second block are the parameters these are the unobserved variables in the model in a fundamental sense in Bayes parameters are just unobserved variables unobserved data will push that metaphor to its limits in a later week and the same thing holds here stand needs to know the types of each of these so it knows which checks to run and which constraints exist so notice for example that sigma here has a constraint lower equals 0 it must be a positive real value finally there's the model block this is the block that's most similar to the code you've been writing so far it restates the distributional assumptions of the model the squiggles we've been putting into our model formulas up to this point but you can put a lot more kinds of code inside of a stand model block than you can in a quap or oolong model although you can do a lot in those as well in very large models complicated models like latent variable models the model block can become very complicated you can analyze multiple data sets at the same time have complicated simultaneous equation systems ordinary differential equations large numbers of things we're going to keep doing regression for a little while with this and we'll gradually build up the complexity there'll be no sudden transition okay to run the stand model what do you do well you take that stand code and you save it as its own text file and this is a hygienic way to do scientific computing you make each stand model its own independent text file so you can manage the complexity of your project and then if you're using rethinking you can just pass this the name of this file to the C stand command C stand for command stand and it will sample just like oolong does because this is what oolong does it builds except oolong builds the stand code for you and then it passes it to command stand and then returns the results to you so you're just skipping the middleman here and hopefully understanding the underlying complexity a little better and then you can work with samples from pure stand models just like you did the others because it's the same output no matter what interface you use if you're using markov chain Monte Carlo if you know the model once you get the samples you can compute anything you need okay let's continue with thinking about the workflow here and in particular I want to emphasize sort of hygiene responsibilities that we have when working at scientific computing algorithms like markov chain Monte Carlo these algorithms are seemingly magical when they work but they're not magical they work because people have spent a lot of time making them work and thinking about ways to diagnose when they don't work so I want to spend a little bit of time introducing you to the diagnostics that we use in markov chain Monte Carlo this is really an essential part of the workflow it's in the same sense that the other things are like stating your estimate and justifying the statistical strategy those are general responsibilities when you use markov chain Monte Carlo you also have a responsibility to check that the algorithm functioned correctly and also to provide the information so your colleagues can trust that it functioned correctly just like all the other parts of your analysis and I'm going to go through five different things just to introduce them to you and as you do this work yourself you'll quickly become comfortable with each of these forms of diagnostics and understand the relationships among them the first one is called trace plots and in a sense this is the graphical display that people most associate with markov chain Monte Carlo in the trace plot all we do is take the sequential samples from the markov chain for each parameter and we plot them as a timeline so here for the divorce model I've taken the intercept parameter alpha and I've just plotted the sequential samples we took a thousand samples from the markov chain starting on the left and going all the way to the right and the gray region is a region that's called the warm up by Stan and the warm up is when Stan is figuring out step size and how long the trajectory should be I say more about this in the book and you can read a lot more about it in the Stan manual if you like we don't usually use the warm up for inference it's not necessarily a valid part of the posterior distribution but it's nice to show it in the trace plot so you can see how the warm up phase proceeded outside of the warm up phase on the right half of this graph those are the 500 samples we have for inference which in this case is plenty you'll see that there's an nf equals 321 displayed in the upper right I'll talk about that in a minute let's turn our gaze away from that for now we don't just look at one parameter though we look at all the parameters we want to ensure that for every parameter that the trace plot looks healthy and what does that mean well that means that it's stationary it tends to stay in the same part of the distribution as these four do you'll see that there's zigzagging around within the same area they don't drift up and down to a large extent as the simulation goes on they're exploring the lower part of the half pipe as it were sometimes people say that it's the hairy caterpillar test you want the trace plots to look like hairy caterpillars if that's something that you're familiar with so one chain is not enough what I just showed you is the trace plot for only one Markov chain and typically we will run more than one Markov chain simultaneously and we do that because the strong test that the chain is healthy is that multiple chains converge to the same distribution this is a criterion we call convergence you want each chain to explore the correct distribution and stay there when it finds it to be stationary and you want every chain to explore the same distribution that is to be stationary in the same place and that means starting multiple chains from independent different starting locations often quite dispersed starting locations and ensure that they converge to the same distribution this is easy to do in Ulam you just add the chains argument four is a good number most computers have lots of cores these days and you can run a different chain on every core in your computer and if you want it to run them all simultaneously be sure to add the cores command as well so here I'm running four chains on four cores and they'll all run simultaneously and this model runs very fast try it for yourself and see and now I'm going to layer them in this is this is what we showed before it's it's the first chain from this model and then the second one it converges very quickly to the same place and stays in the same place and then the third one and then the fourth one and this is what you want to see now this is not a guarantee that the chain function that the Markov chain function correctly but if the Markov chain functions correctly this is what you should see trace plots are pretty limited though one of the things about them is that they the chains overlap the fuzzy caterpillar is too complicated and so there could be a chain in there that's misbehaving and you can't see it because it's occluded by the other parts of the caterpillar if that makes any sense so a more recent graphical innovation that I like a lot more are these normalized trace rank plots I call them trunk plots no one else does but I'm going to keep trying to make fetch happen and trunk plots well first of all they're just gorgeous I hope you agree to the one on this slide what are they though so again they're trace plots so the horizontal axis is sequential samples from the far left is the start of the simulation and then on the right the end and this is just the post warm-up sample so I'm not showing the gray region for the trunk plot and then instead of the vertical axis representing the actual parameter value we've ordered all the values to ranks relative order rank orders across the chains and why would we do this well because it makes a quick heuristic way well I should say it's good for the calculations of the later criteria like our hat and F that we're going to talk about in a bit but it makes a great visual heuristic way to quickly assess whether any chain is consistently above or below the others I'll say that again it creates a quick heuristic visual way to assess if any one of the chains is consistently above or below the others what you want to see is what I'm showing here which is that the colors zigzag in and out of one another in this mosaic fashion so the which chain is on top is constantly switching and that means that the chains are exploring the same space I'll show you what a bad one looks like in a few slides so hang on and you can see for the other parameters the same sort of pattern this is a nice healthy chain you get this mosaic switching of all the different chain colors as that because no chain is consistently occupying a higher space and parameter values than the others they quickly switch and dodge and weave among one another okay it's time to talk about these these two numerical diagnostics that are displayed in the pracey output and F and our hat let's start with our hat our hat is quite common and maybe you've already heard of it when chains converge we want them to converge in two ways first we want to check that the start and end of each chain is exploring the same region that means they're stationary the chains are stationary the second thing is we want to be sure that independent chains explore the same region so we want the chain to converge with itself that is it's starting it's in to be in the same place and we want different chains to converge to the same location as well independent chains so our hat is a statistical criterion for assessing this but it's not a test and it provides no guarantees nevertheless it's really useful what is it it's a ratio of variances so as the total variance among all the chains for some particular parameter uh shrinks to be the same as the average variance within chains then our hat approaches one that's just the way our hat is because it's it's a ratio of these two things the total variance over the variance within chains so what that one way to think about this is I try to help you with this plot on the right I've shown you for the the chains that came out of this divorce model samples on the horizontal and then variance on the vertical and I've computed in red the between chain variance what does that mean it means the variance among the means of the chains the mean parameter values of the different chains and then the blue trend is the the average variance within chains so we compute the variance of each chain and then we average those variances that's the blue trend so what you see is that both of these decline as the chain converges right so initially they're very high because the chains are in different places and they haven't settled down yet and then the variance is decline as they stabilize and become stationary but the red convert collapses much faster because these are good chains and that's what you want to see is that the variance is almost all the variance in among the chains is is within them not between them because they've converged to the same place and so this is what our head does is it tends to approach one it'll approach it from above though and so large values above one are bad and if you run the chain long enough you want to see it shrink down to one or just above one like 1.01 or 1.02 but there's no threshold this is not a test it's not a significance test or anything like that it's a diagnostic criterion to help you as a warning signal our can equal one and chains can still be bad but every little bit of information helps us nf is the last one of these criteria this is the often called the effective number of samples so what does this mean one way to think about this is that nf is the answer to the question how long would a chain be if each sample from that chain was independent of the one just before it this is a weird question what kind of maniac would ask a question like this it's actually a really important question because Markov chains are sequential well they're Markov chains they're sequential the next parameter sample depends only on the most recent one and what can happen as a consequence if the chain is not exploring the posterior distribution efficiently is that the sequential samples will be correlated with one another and we usually call this autocorrelation that is self-correlation so what I've shown you in the graph in the lower right of this slide is something called the autocorrelation function for the chain Markov chain for the intercept in the divorce model and the way to read these autocorrelation function graphs is that on the far left at zero lag the horizontal axis is lag you've got any location in the chain this is like zero means self and then the vertical axis is correlation and so you'll see that the red bar there goes all the way up to one that means the sample is 100% correlated with itself yeah well thank you but the next sample the one at lag one right next to it to the right is as a correlation about 0.5 and that means that each sequential sample is 50% correlated and then that correlation declines as we move away so that by the time we're five samples away there's essentially no correlation between the parameter values that are sampled this could be better often Stan will do even better than this and the autocorrelation will drop to almost zero after one sample but Markov chains often show some autocorrelation of this kind but you don't want it to go too far out why it's not doom it doesn't mean the chains bad after all the metropolis algorithm generates really highly autocorrelated samples but it still works it doesn't mean anything it's not working just means it's less efficient and that means you have to sample longer so NF is the answer to if we could imagine a chain and we can where there was no autocorrelation at all so the autocorrelation function had a spike at zero and then it was zero everywhere else had a spike to one at the zero lag and then lag one was zero correlation and so on NF tells you how long that chain would be so this is why it's called the effective number of samples and the effective number of samples is nearly always not always but nearly always smaller than the actual number of samples you took because your chain has autocorrelation and that means sequential samples share information so they overlap in the information they provide if they were totally uncorrelated they provide more information and so you'll see in the preci output in the upper right the NF so for this particular model we drew 2,000 samples there are 4 chains and we're getting 500 samples from each so that's 2,000 samples that we get to work with but all the NF values are less than 2,000 still you don't need a lot of effective samples to get a good estimate of a posterior mean or even higher moments some of the outer quartiles so these are good NF's how big should NF be that depends upon your problem I can't give you a general answer to that but I say some more about it in the book okay last one to talk about in our hygienic workflow is this thing called divergent transitions and I'm just going to say a little bit about them today and we'll talk a lot more about them in a future lecture the animation I'm showing you here is our skate park example again with a complicated shape but I've changed the trajectories so that they're very jagged so that I can show you what it looks like when Hamiltonian Monte Carlo rejects a proposal so in the simulation you'll see this red point appears right there and then when the simulation continues after it it doesn't continue from the red point but from where it started the previous time so watch it see it starts over from where it was before that red point is the end of a divergent transition it's a rejected proposal what is a divergent transition well so you learned that metropolis the metropolis algorithm rejects a lot of proposals sometimes most of them unfortunately and that makes it inefficient one of the reasons Hamiltonian Monte Carlo is so efficient is that it it can accept let's say it makes really good proposals because it follows the contours the gradient of the posterior probability to find good proposals but it still occasionally makes a proposal that it won't accept and these happen when the simulation malfunctions in a sense it diverges from the true path so what does that mean so there's a smooth skate park distribution that we're actually trying to sample from but we can't see it and the simulation we're running is actually discrete that's what those little bar segments are in the path and I've made them really big here to exaggerate the point typically we'd make them smaller why because when the line segments are smaller we get a smoother approximation of the posterior distribution of its curvature we can roll more smoothly on the half pipe but if the half pipe is very steep in some particular location which means of course the posterior distribution has a place where it changes very rapidly in posterior probability then the simulation might have time turning that fast and it could crash or break through the wall in a sense inside the computer and nothing bad is going to happen there except that the algorithm will detect that it has a way to detect this and because this is a physical system a simulation of a physical system rather and in physical systems the total energy is constant there's no friction in the system nothing's lost to heat and so if the energy at the start of the path is different from the energy at the end it's a divergent transition and then those reject those proposals are rejected and they're called divergent transitions in Hamiltonian Monte Carlo algorithms so having rejected proposals is not a bad thing like I said metropolis will often reject most proposals but if you've got a lot of divergent transitions it could be that the algorithm is exploring the posterior distribution very poorly and it's inefficient and it could result in bias if it's always the same region of the posterior where the divergent transitions are happening then you're not taking samples from that region and that may not be very good at all so that's all I'm going to say about divergent transitions for now but in a later lecture when we start talking about multi-level models I'm going to come back to this and I'm going to show you ways to program around divergent transitions because they're a fact of life but we live with them and they're not poison they don't destroy your project but we should try to get rid of them if we can so the machine is more efficient and we avoid possible bias okay before I end this lecture I want to show you some examples of badly behaving chains not just good ones right it's if you can know what to recognize that'll make you a lot more comfortable I hope so here's an example from chapter 9 of the book model 9.2 it's a very simple example two Gaussian observations minus one in one and we're just going to estimate the mean and standard deviation of the distribution these points come from now if your reflex is to complain wait it's only two data points you can't do that remember this is base we can do that the minimum sample size for base is zero right we can simulate from the prior even if we have no data and we can update the posterior distribution with one data point here we have two this is like luxury but I've what's unusual in this model that I've never done before is I've used extremely flat priors so the alpha the mean here the intercept has a Gaussian prior with mean zero and a standard deviation of a thousand so that's that's a huge variance and this is an extremely flat normal distribution normal distribution relative to the scale of the data and then similarly for the standard deviation I've given an exponential distribution with a rate to point zero zero zero one which means it declines very very slowly so it's very very flat out to very high values we run this model using Hamiltonian Monte Carlo and Ulam and the precy output is monstrous now you know what the mean should be it should be about zero right what's the average of minus one and one you can you can fit this posterior distribution with your eyeballs almost that you look at this the the mean of alpha is is about minus 30 and the standard deviation is 80 this model has no idea where alpha is and likewise has no idea where sigma is as a mean of 390 and a standard deviation of about a thousand and if you look at the NF and our hat values you'll see that these are quite low actually the our hats aren't so bad so this is to give you an idea just because our hat is near one doesn't mean the chain worked but the NF's are way below the actual samples that we took if you look at the for I moved to the next slide you'll also get a warning when you run this model about divergent transitions and this is to say that the algorithm was having a hard time exploring this posterior distribution and that resulted in 108 divergent transitions and it tells you this so that you might want to try to do something about them you'll see that there's some possible remedies suggested by this warning message and the third one is using informative or weekly informative prior distributions and that's what we're going to do the problem here is that these priors basically say the whole real number line is is open game for these parameters and that's not true what did the trace plots look like I show them at the top of this slide and you'll see that these are not nice healthy chains no no hairy caterpillar illusion here chains like this maybe in a really long run will give you a reasonable reputation representation of the posterior distribution but you shouldn't trust them and no one else should trust them either and then on the bottom the same traces but shown as normalized rank plots or trank plots and you'll see that at any particular point the chains zigzag in and out of one another but only very slowly so for long stretches one particular color be it orange or green lies above all the others and that's not what you want to see okay so in these sorts of cases I like to invoke something from Andrew Gellman who's one of the gurus of Bayesian data analysis or here I call him the spider-man in Bayesian data analysis this thing called the folk theorem of statistical computing so when you have computational problems Gellman often says often it's because there's a problem with your model and by that he doesn't mean the code he means the whole concept of the model itself that is the probability assumptions that go into the model and in the example here that means the priors it's not the algorithm or the computer that's the problem here it's the assumptions that go the statistical assumptions that have defined the posterior distribution itself that are the problem you're trying to make an algorithm do something which is not a good idea and it refuses to do it why do I call Andrew Gellman one of the statistics here well you'll see the spider-man above him I think the saying goes with great power comes great responsibility and that's the thing about these algorithms we use them because they do really powerful things for us but we have this responsibility to use them for appropriate ends for appropriate goals and that means educating ourselves on what good scientific assumptions are to put into them you can't fit any arbitrary scientifically ill-formed model so that's the folk theory of most statistical computing and it applies in this case quite well so on the left of this slide I've showed you the same model that gives you bad chains and on the right I've changed it just so the priors are narrower but they're not all that narrow so now alpha has a normal prior with a mean of 1 and a standard deviation of 10 why didn't I make it a mean of 1 I'm being a bit cheeky here I just wanted to show you that the prior being wrong does not destroy the inference and a standard deviation of 10 is still a big standard deviation that's a variance of 100 this is still a very flat prior distribution but it's not a standard deviation of 1000 it's qualitatively different and sigma to have a rate of 1 so that means that the expected average standard deviation is about 1 and for standardized data of course that makes a lot of sense but still this is not a very informative prior either and when we fit this again now we get quite sensible results and we learn the mean and standard deviation of the data as they were provided the r-hats are very close to 1 1.01 and 1 exactly and the n-f's are much higher the trace plots look much better the hairy caterpillar plots that we see here sigma it has an upward a trend upward the upward tail there is longer that's very normal for scale parameters that's not a pathology that's what the posterior distribution actually looks like and then the tranq plots on the bottom row look as they should you see that zigzagging mosaic kind of pattern as the chains swap off who's on top okay we've covered a lot of ground in this lecture I want to kind of loop back to the historical picture of these methods desktop Markov chain Monte Carlo really has been a revolution in scientific computing and it's not very old it's only since the 1980s the end of the Cold War really when this desktop Markov chain Monte Carlo revolution began and it was begun by statisticians like Sir David Spiegelhalter pictured there on the right and his colleagues had this wild idea to start a project called WinBugs where the user, the scientist or the consulting statistician could specify arbitrary statistical models in the forms of probabilistic assumptions and like we've been doing in this course and then the software itself would figure out the problem of designing a Markov chain sampler for it so that you didn't have to muck around with that part of it and this radically accelerates scientific discovery and it catalyzed a bunch of later computer software development and also stimulated a bunch of really deep mathematical research into Markov chain algorithms this has been a fantastic thing and in all of this scientific computing has been in the forefront something that's not often given as much credit as it deserves because it's really enabled scientists to take charge of their research projects and no longer hostage to the sorts of algorithms that any particular statistician wants to program for them they can come to their problems with their models and they can program them directly in software like bugs, I don't think very many people use bugs anymore or stan or touring or some other kind of modern probabilistic programming language and we can work in high dimensional problems and many important scientific problems are high dimension we can do very important things like propagate measurement error we'll talk about that near the end of this course but all along of course for this to work for us as applied data analysts as scientists, as researchers we have to remember not to pipette by mouth we have to check the diagnostics both for our own peace of mind and also so our colleagues will trust our results okay that has been week 4 of statistical rethinking 2022 next week we're going to get back to models and we're going to start talking about bigger family of regressions called generalized linear models which will open up a lot of exciting avenues for getting your scientific ideas meshed with your statistical analyses and I'll see you there