 to probabilistic machine learning lecture number five. Here is where we are in the progress of the course. We began in lecture one to discover that probability measures our way to encode uncertainty in reasoning, the anomaly extension of propositional logic. Lecture two showed us that this process can be computationally very challenging unless we use structural prior information about the decoupling or the conditional decoupling of the inference process when certain variables are available. In lecture three, we saw that this notion can be extended to continuous variables to reason about quantities that are lengths, and rates, and other real variables. In the last lecture, we began to think about how to actually implement probabilistic reasoning in practice. This is primarily a computational issue, and we saw that the key operation to be done in this kind of process is integration, which shows up in various forms to compute various quantities. Usually, these quantities are a form of expectation of some function f. Here we go. That function might well be the loaded function, then we are computing a normalization constant. It could also be the expectation of a likelihood, that it's the evidence, it could be the expectation of the linear function, that it's the mean of this distribution, or the expectation of any other polynomial of x, that it's a moment of this distribution. These integrals can be very hard if the distribution is complicated, and one generic, though not necessarily the perfect or best way to compute these integrals in practice, is to draw random samples from the distribution defined by the probability that's in function p, if it exists, and then some evaluations of the function f at these long variables, and normalized by the number of samples. This gives a random number known as the Monte Carlo estimator, which we showed last time to be an unbiased estimator, whose variance drops with a rate that is proportional to the inverse number of samples, which means that the expected error, which is the square root of the variance, drops like one over the square root of the number of samples. We saw that this is in some sense a good rate, because actually there is no better rate for unbiased estimators, and also it means that after a very small number of samples, we get a relatively good ballpark estimate for the value of the number of times of computing. It's not a precise way to compute, not an efficient way to compute a precise estimate, because it takes an extremely large, square, large number of samples to drop the error over a small rate. We also saw, though, that as nice as this idea is, there is a challenge here, that even though the process that I've written down here is actually a relatively simple computation, it sort of hides the computational issue that we need to draw random numbers from the distribution in the first place. And this drawing random numbers operation can actually be quite challenging. We saw that for basic analytic distributions, we can do this using the definition of a random variable by drawing from a uniform or base measure probability distribution using a random number generator, and then pushing these, like, mapping these random samples to a non-linear function such that their derived density or measure is the one that we're looking for. However, this requires us to know a lot and actually find this correct translation, transformation. If we don't have that, I mean to use a more generic algorithm. And so we spoke about this last time, just a quick recap. Doing so can be hard for various reasons. We found out that we can fix a few of these issues. One issue is that typically, for example, in Bayesian inference, we don't have access to the probability density function we tried to draw from, only to an unnormalised version of it because we might have prior likelihood but the evidence is unknown, being it into the world itself. That is something that we might be able to fix by using an algorithm that can deal with this non-normalised property. But there is a larger problem which is that doing so typically requires a global description of what they're called. So we saw some simple algorithms which can be used for no-dimensional, to be one-dimensional problems, reason to be efficient if you know a little bit. For example, rejects or something allows us to work on this based on the idea of let's just construct a probability distribution that's scaled in the right way, puts an upper bound onto the thing we're trying to draw from, the PF we try to draw from. Then draw samples, reject all the ones that are not covered by this area under the curve, those are actually samples, but they are IIE samples from the true-like distribution. But we saw that the rejection rate of this process can be quite bad even in the one-dimensional case. I mean, here the rejection rate is like one-half, roughly, as we lose about half of our samples in multiple dimensions, the problem can be much harder. And so I gave you a simple example, constructed or taken from the whole value group of chi, showing that even for extremely simple base cases where we have a very good fit, in some sense, for the distribution we're trying to sample from, the rejection rate can be exponentially high in the dimensionality of the problem. At the very end of the lecture, we also saw that you might come up with, sort of, in some sense, naive ways of easing this problem. One of them is important sampling, which is based on the idea of not to reject samples outright, but to weigh every single sample with essentially its rejection probability that allows you to keep all of the samples around, so you don't have to throw them all away. But V is essentially just a smooth version of rejection sampling, it has all the same problems. In particular, it has maybe even an additional problem, which is that because you don't reject samples, you don't really notice as much how little information individual sampling reduces. So I constructed this perhaps a little bit overloaded plot that shows that if you use a probability distribution like this red term, which doesn't have as much mass, which has a very strongly suppressed probability density at the regions where the target distribution has non-terior density, then these samples of important sampling can have extremely high weights in these regions, and these weights happen very rarely. So the distribution has a very high variance, but it actually can take quite a long time for you to notice or even infinitely long time for you to notice that this variance is very high because these extremely high value, high weight samples are drawn extremely well. So these methods, which are relatively all, are not a good way to construct general Monte Carlo algorithms, essentially, or to construct general samples from high dimensional non-teriority degrees. So what I want to do in today's lecture is to give you a brief intro to some of the Monte Carlo methods that you might call state of the art that are actually used in software packages that are currently used in machine learning to do probabilistic algorithms. I should maybe say that we're doing this at this point in the course, which is a little bit early maybe to talk about these some low-level computational issues. The reason I would be doing it at this point is that for you to do exercises that go alongside this course, you need something in your toolbox that allows you to actually do some implementation. So if you don't talk about any algorithms for the first few weeks, then you don't really have the tools at hand to solve the tasks that we have on the exercise sheets. So if some of the, like I'm going to be forced to use a few concepts that are not particularly hard, but which come maybe a little bit early in this course and I will brush over them a little bit, not because they're not important or not because I assume that you know them, but because we will talk about them again as we move forward in the course. Sometimes it can actually be helpful to talk, to use concepts in a more intuitive fashion early on in the course so that your brain already adapts to the use of these words. And then when they finally show up at the course, you actually get to understand them more precisely because you know that they are already important. Okay, so the methods we're going to be talking about are methods that address these issues that we've just identified with low level Monte Carlo methods like rejection sampling and important sampling and something you might call exact Monte Carlo, which means drawing from a base measure and then transforming to a non-linear distribution that draws directly from the memory Monte Carlo. Just to summarize, these methods are not particularly, like what they're actually challenging to design because Monte Carlo methods are innovation methods but integration is hard and especially it's hard in high dimensions. So practical Monte Carlo methods need to be able to deal with the fact that typical distribution is contained in the world that we don't know so they have to be scaled by have to be able to work with distributions that are scaled by unknown constant and they have to address some more challenging issue which is that all the methods we spoke about so far are in some sense global. So the proposal distribution in importance or rejection sampling have to be designed such that there are a good model for the distribution we're trying to draw from everywhere. They don't have to be perfect because the algorithms take care of the mismatches by berating or by rejecting but they still have to actually cover the entire measure and put, particularly I have to put mass wherever the measure we want to draw from has mass but also they shouldn't put too little mass where our target measure has high mass and also not too high mass into regions where our target measure has no mass because otherwise we're gonna reject a lot of samples or we have to wait for a lot of time to get a set of samples. The class of methods that addresses this issue and which is like contains among its members the state of the art in sampling in Monte Carlo methods is called Markov chain Monte Carlo or would you be abbreviated as MCMC and it's based on a very intriguing idea which can be a little bit tricky to understand at first. And it's based on the idea that instead of constructing IID samples from the target distribution directly in a complicated computation maybe we can get away by not producing IID samples but producing a sequence of conditionally dependent samples such that over time when we take these samples afterwards and maybe scramble them we can actually think of them as IID samples. Here is how this works. So for every first thing to define a concept which will actually be important at a later point in the lecture as well for real applications it's a very basic notion called a Markov chain. This goes back to Russian mathematician who is actually a contemporary of Andrey Konbolorov and wrote a text that was actually originally published only in Russian and took a bit of time actually Konbolorov helped pop to memorize it, I don't know. That is a very specific kind of conditional independence that is not quite independence but almost and it corresponds to this kind of mathematical model. So a sequence of random numbers that are distributed jointly such that the conditional distribution for one variable in this chain given all its precursors is actually equal to the probability distribution for that one variable given its immediate precursor. So what that means is and you can only read this off from this graph if you watch lecture two that every single variable is independent of all the previous variables and even the immediate predecessors. The natural intuition that is implied by this picture is that of a time series that has a memoryless time sequence process that just knows where it's currently at it doesn't care about where it came from and just moves forward. These Markov chain Monte Carlo methods or Monte Carlo methods that are based on this structure of a Markov chain iteratively draw samples such that they produce a kind of dynamic process that moves around the support of our probability measure such that over time there is this chain if you take all its elements and scramble them randomly are actually at least asymptotically independent draws from the target measure. How this actually works and why this is the case we now have to see by constructing one of these algorithms and we're going to construct initially the most basic one which is maybe original MCMC method and then we'll find problems with it and iteratively fix them in several iterations from one step to the next and eventually at the end of this lecture we will reach a relatively complicated algorithm that is actually arguably state of the art or at least very close to state of the art it's not the only state of the art algorithm there are several such Markov chain Monte Carlo methods currently in use and the picture of which of them is best for which particular setting is constantly changing at this point while I'm recording this lecture but we will reach one algorithm at least that you can think of as very close to the state of the art even though it's already a single digit number of years old. Okay, but let's begin with a really old algorithm one that goes back about half a century again to this bunch of nuclear physicists if you like working on the Manhattan project in during the end of the Second World War and the Cold War and it's an algorithm that I'll use in the next slide it's called the Tropicalist Hastings and it's based on the following intuition so let's first think about a setting in which we don't want to draw from the distribution we just want to find the maximum of our probability density function so that's an optimization problem optimization is actually easier on the integration because we're just trying to find one point rather than a distribution of points we could do that using the following optimization algorithm which you might call a stochastic optimization algorithm it's not a great optimization method I don't recommend that you build your optimizer that way but it helps the intuition of what we're trying to do so let's say we're trying to build our sequence of Markov chain of random numbers and let's say we want this Markov chain to converge to the maximum at least a motor maximum of the target distribution P so we have a current estimate of XT and we're going to use something which we'll call a proposal distribution which I'll call Q Q is not the thing we're trying to draw from we're trying to draw from Q2 of QEH and all my version of QT so that proposal distribution can be well, more or less anything let's imagine that it's a distribution that typically you will need to have at least locally it's a relatively smoothness property and whenever you saw at C I just took a term in the equations I'm going to introduce that contains Q typically in a numerator and a denominator just assume that that quantity exists so that's the level of regularity we need from Q what we're going to do is we're currently at some stage in our Markov chain XT we're going to use this proposal distribution to produce a random number that is distributed not going to Q Q might depend on where we currently are so it might for example be a distribution that is centered at XT around XT and produces a sample at that point then we evaluate the ratio between the value of our target density at the proposed point and where we currently are if that number is larger than zero that means that Q tilde at the sample point it has a higher value than where we currently are because we are currently trying to optimize that's a good thing so we're just going to step there we will call this accepting this point and we will add it to our Markov chain if that's not the case so if the point we've just tested is below the point we currently add on the value of P at that point it's below the one we're currently at then we just don't weigh anything we just add the like we take a step in the Markov chain in time basically and we just keep it where it's currently at so we add the next step the next location of X to where we're currently at or XT plus one is equal to XT okay so this process evidently but asymptotically get us to a point that is a local maximum assuming that Q has is sort of smooth and has sufficient support around all the states we're currently in and we're not going to show that I guess it's intuitive assuming that Q is sufficiently regular now we don't want to have an algorithm that only moves to the maximum when they get stuck there we want to have an algorithm which asymptotically produces draws that actually move around in the entire space we do want something similar to optimization though which is that regions that have high P high quality density should be visited more often so it should be more likely to to step to a point that has a higher density than to a point that has a lower density and we're going to do that using the following rule which is actually that the top limit case things method is known as the MKH the top limit case things method who actually came up with this method is a little bit unclear and why this is unclear probably has something to do with the fact that it was invented somewhere in the Manhattan Project in Los Alamos in an environment started by secrecy so the actual people who might have invented it might not have been able or willing to write a paper about it which is why the original authorship is a little bit complicated so even though the method is called the top limit case things it might actually be due to the brothers Rosenblut and Edward Teller who at least wrote a paper about it in 1932 but we don't believe that it doesn't matter though because they're thinking going to have much nicer other things the way the top limit case things works is as follows we're going to draw as before our random number from the reposal distribution for example the reposal distribution might be a Gaussian distribution centered at the current location I'm going to show you a picture of what this looks like in a moment and then you'll see or maybe get an intuitive picture of what this was and then when we're again going to evaluate this ratio between the probability at the point of reposal distribution says we might want to step to and the density where we are currently at and actually we're going to multiply it with the ratio between these two quantities which are in the complicated to say out loud so that's the reposal distribution for actually in the denominator that's the reposal distribution for the proposed point given the current location and we multiply inverse of this with the sort of inverted reposal distribution for the current point given by the current gap so one way to think about this is this is the probability down here to step where we want to step given that we're currently here and up here that's the probability that if we were currently where we are now thinking about going we might step back to where we currently are why may the ratio we'll see in a moment let's just say it's there and then the algorithm is different now from before we're not going to optimize and only accept that this ratio is larger than one instead we're going to accept it with ever this ratio is larger than one so every time this step goes upwards we accept it but if the ratio is less than one then we actually accept it with a probability given by that ratio so look at that this is a number that is lower bounded by zero because these are all probabilities so these are all numbers that are larger or equal to zero we assume that these numbers are in the denominator and they are zero why? because this we just assumed by design and this we assume I mean sort of by design because we have drawn that number so we wouldn't draw it if the probability is zero and this is not going to happen if you design on Markov chain right unless we start at the point that there's zero probability we should never move to a point as zero probability okay so I'll show you a picture now how this works before we do that two quick things to keep in mind one interesting aspect here is that there is no rejection it's not rejection or something no point is ever rejected so if there is well I mean a point might be rejected but the Markov chain never stops so there's always a time step that goes from t to t plus one however what we might do is we might decide could copy our current location into the Markov chain so if this number is less than one and we now draw a uniform when the number and that uniform when the number is larger than a then with this point proposed point x prime is not accepted and what that means is that the Markov chain steps forward by one step and you keep a copy of where we currently were into the next or remove that into the next time step this means we now have two at the same point in our Markov chain twice after this step and this means that in the end when we take all these samples and scramble them this point will show up more frequently and this would be necessary for the truth to work or the truth it's going to be a simple argument for why this is the right way so how does this actually work in practice I'm going to show you two different visualizations to give you a pictorial view on how this algorithm works I know that the next visualization I'm going to show you which is designed to be very specific sometimes confuses people so I'm going to go slow so let's say you want to use this metropolis-hastings algorithm to draw from the black part which you see here that's this sort of walking world distribution that's the thing we want to draw on of course we don't really want to do the same 1D but 1D blocks are in between the red part is our proportional distribution we assume that we use here a Gaussian distribution I already told you in the previous lecture my Gaussian distribution is this bell shaped part now one fun thing about Gaussian distributions is that they are symmetric around the mean which means that very specifically the probability to move to this point given that we're currently at this point is equal to the probability that we're moving to this point given that we're currently at this point which means that this ratio here is just 1 so we can build up these terms and we're left with these two terms in key 2 that's not necessary for the algorithm to work but it's nice because it simplifies that part okay so what just happened here is that we're currently at this point that's our current location we're currently at 3 and the proportional distribution has proposed to move to this location 1.1 or so now what we're going to do is we evaluate the ratio between this number so that's p tilde at x prime and this number which is p tilde at x t and the way I do this is that I draw horizontal line connecting these points and then draw a vertical bar from the lower point to the upper one now you see that this ratio is larger than 1 because this value is higher than this value great so we're just going to accept this point the rule is we move to this point if we do this now we move to this point and what I do here underneath is that I'm creating little black dots for all samples that are entered in the macro chain and I've got them at a height that is chosen uniformly between 0 and the height of the black curve at that point why am I doing that? because that means if the macro chain works well and asymptotically we could do samples that actually come from the true density then if we run this process for a long time we expect these black dots to be uniformly distributed in this area that is bounded by the black curve okay, so what happens next? so our proposal distribution we're now here has produced a new proposal that proposal is to move to this point here so now we take the ratio between this value which I've got here and sorry, that's proposed to move here actually to this point I'm sorry, to this point, right? and that's very low so the probability for this point to be accepted as a step which is the ratio between this number and this number and if you can read this off on this vertical bar here if this bit down here that's very small so the probability for this to be accepted is actually quite low so now I've produced a random number that is golden dots that is uniformly distributed on this region and that without this above so that means this point will not be accepted so what we do is we just add one more black dot at this location again for a uniform height between zero and nine that's the point, we'll just move it over here so let's do that by the way, notice that the process here requires us at every time step to evaluate the value of the black curve so this is not for free evaluating P, of course, is something that takes computational work for this here, it's a very simple P, it's trivial work for a complicated simulation model this might be an expensive thing to do now, basically we decided to stay at this point we move forward in time so now we, our possible distribution produces a new random sample it's over here we take the ratio between this value and this value that is given by where this horizontal bar is on the vertical bar we draw a random number it's below the bar so that means we accept this and we move to this point and now, I guess by now we might not have understood how this works we'll keep doing this for longer and longer and over time we see here the samples produced after I think something like 300 steps actually it's up here, about 300 steps and you might sort of look at this picture and maybe one thing you see is at one of these points where I'll normally ask you a question what do you see about this plot if you don't see it immediately maybe stop the video here once you stare at this picture for a while you will see maybe that the distribution of black dots under this curve doesn't look entirely uniform this is because after 300 steps this mark of shade has not, we say, sufficiently mixed yet so we actually haven't drawn IID in uniform random but IID variables that are distributed according to the black curve which means that the black dots will be uniformly distributed across the area delineated by the black curve and if you stare at this picture for longer you might actually get an intuition for why the dots on this picture are distributed the way they are due to how the mark of shade moves around so you can, for example, see that there's a little bit of amassing of dots here and maybe over here as well and I'll let you figure out why this mark of shade is so actually let's move two steps forward I'll show you one more visualization of actually I can show it to you now I'll show you one more visualization of how this case works in a two-dimensional fashion and for this I'm going to use a visualization that wasn't created by me it was created by a guy called look at what's going on by a guy called Qi Feng he was at least at the time that I currently can find the last record of him online was a student at MIT his code is available on GitHub you can find the link here it's a really great visualization of how what about his case things works which is why I'm going to show it to you even though I didn't write it myself he doesn't have a copyright statement I can find I'm hoping he's fine with me showing this very much recommended with this GitHub page there's lots of cool stuff on there it's a great visualization and we should use cool deductive tools if we have them so let's do this here's what this looks like I'm going to show you many of these this plot over and over again with different algorithms over the course of the lecture what you're seeing here is a two-dimensional probability distribution here it is, it's a banana-shaped distribution actually this nice visualization allows different distributions I'm going to stick with this one here on the left and on the side you see miles of distributions so projections of this thing this distribution onto the walls this is what they look like and then you're going to see over time a histogram will appear on these samples or we're going to use this with somebody's Hastings method I'm just going to start a visualization and start it and what you see here is the algorithm initially would use the sample over here and now what happened is that it uses this proposed method which is a Gaussian distribution actually and every time you see one of these arrows showing up the algorithm draws a sample and evaluates this metropolis Hastings ratio if the arrow turns red that means that the sample is not accepted and then the algorithm picks more and more copies and you can see if it stays at one point how the corresponding bar histograms up here actually revise up because these samples keep being added to the distribution and every time the arrow turns green then the probability to accept was reached by the random sample and the algorithm actually moves so then you can see it moving forward and over time I can actually speed up the simulation a little bit I think and now you see this algorithm move around and you can see it populating this distribution as it moves around and you can also see that this is in some sense an interesting process it also has maybe a few flaws that we have to talk about but before we talk about these flaws let me first show you how people show that this is actually a good algorithm to use but then I have to stop this method and recycle it and move forward how do people show that this is an acceptable method to sample random methods the standard mathematical machinery to show that these algorithms work is to show two different properties of the macro chain called detailed balance and algorithmicity these are actually not the necessary requirements for this method to work but they are sufficient for it to work and they are typically what people use to show that a certain macro chain of the experiment is a good idea what we want to show is that the distribution of the samples from this macro chain so assuming that we have probability distribution that fulfills here is one of the algorithmicity requirements and then if we start at any arbitrary point in the support of the probability distribution then the distribution of XT which is the asymptotic density of the samples from this macro chain approaches the probability distribution and immediately how do people show that well they show two properties which I said are called detailed balance and algorithmicity if we find these properties then the probability testings fulfills them detailed balance is the property given by this equation this is the left hand side this is the right hand side which you could read as saying so in capital T I mean the probability for the algorithm not the proposal distribution but the entire algorithm to move from a current location X to a new location X prime detailed balance means that the probability to be in one point is having distribution and then moves to the other point under the transition probability is equal to the probability to be at the other point and move from that other point to the current location let's not think about why is this the right thing to show but let's first show that the capital S Hastings fulfills this property how do we show this well we take we just plug in this what T actually is T is the probability for the entire algorithm to make that step so for that to happen the proposal distribution has to propose this location and then this location has to be accepted the probability to accept that point is either one if this ratio is larger than one or if this ratio so it's the minimum of one and this ratio which lies between zero and well the minimum of this lies between zero and one now what we do is because this is a minimum so the the location of the minimum is not changed by a molecular transformation so by a multiplication by a positive number which is which is why we can drag this number inside of the minimum so that means on here we just get this number and here notice that these are the same terms so these two cancel with each other and now we have this on the right-hand side and this on the left-hand side of the minimum and now we do the opposite thing we drag that number out so we take it outside of the minimum which again we are allowed to do because this is a number that is larger than zero and now we have here exactly the probability that we are looking for so here we have an x prime and this here is the probability because the minimum is always symmetric for the algorithm to given that we are an x prime we propose x and then accept it so the amount of change that have this property called deep head balance have at least one stationary distribution which means that asymptotically if we go on many of these samples and look at the density of these samples that density is given by this so called stationary distribution and this is not really a proof this is just a sketch of how this works we do this properly of course you have to make much more granularity assumptions but this is not a technical math lecturer this is not your lecture on the mathematics instead I'm just going to show you roughly how this proof works look at this this is the what you would get if you assume that the current that the algorithm produces samples that come from this distribution and now take all of these samples everywhere and now for each sample apply the transition probability for x to whatever x prime is and then integrate out all of these points that the algorithm might have a distribution over so assume that you have a bunch of samples that come from this distribution and then apply what the algorithm will do in a single step that's kind of a convolution if you like with this process then because of detail balance we can rewrite it like this notice that the integral is over x so p of x prime doesn't depend on x you can take it outside of the integral and then notice that this here is a probability over x it's the probability to move to x given that we're going to get x prime so the probability is integrated to 1 so we're left with p of x prime so what this means here is completely speaking if you imagine that the algorithm would at some point have that the distribution be and then you apply so if there are many many many samples they all come from p then you apply one step of the algorithm then afterwards the samples still come from p that's nice of course it doesn't guarantee yet that the algorithm actually will reach this distribution it just means that if it is once at least in this station at a distribution it will then stay in this distribution and it only has it only says that there is at least one substation at a distribution and the algorithm actually reaches it we require the second property which is called abodicity abodicity is an even more complicated concept to define it roughly leaving out a bunch of technical statements amounts to the statement that's saying there's the sequence on my blockchain such a sequence is called ergodic if it is a periodic it doesn't contain a recurring sequence so it doesn't go back and start creating recurring sequences then it has what's called positive recurrence which means that if the chain is at some point at x star then there is a number t prime larger than t such that the probability to return to x star is larger than 0 so the chain has a non-trivial probability of non-zero probability of returning to where it was but if it does that there is it's impossible the probability for this to happen over and over again is not a 1 but nobody says things is ergodic if we draw random numbers this this is not a truth but you can say we do it if we think wise is the case well because we are using random numbers to decide our step so therefore we know that if we if we if we return to one point we've met we've been at before then the probability to number we start that cycle is 0 ergodic Markov chains have at most one stationary distribution I'm just stating that this is not a proof and it was actually quite tricky to do that so nobody says things therefore has at least one stationary distribution and at most one stationary distribution and therefore it actually approaches to stationary distribution over time so this means if you have a whole distribution that is non-zero all on the entire support of P then for any stationary point x0 in this domain the density of these samples for extremely long runs of this chain approaches P now this means that we are sort of allowing to use this method because over time reaches the distribution we're looking for it doesn't say though that this algorithm is going to converge with the rates that we introduced in the last lecture from the department and what I mean by that is actually better and easier is shown in two visualizations for the first one I will return to this beautiful code by T. Fang and I'll run this again this fast thing and now one thing you can do here actually is change the width of the proposal distribution so how do you I've sort of glanced over the fact that we need to come up with this distribution here which is here a Gaussian actually and a Gaussian has a parameter that's the width of this distribution how do I set this distribution well you have to set the parameter like the width so let's say we make it very small that's here now the distribution is very small and what happens is you can see every time the algorithm proposes a location it's almost always accepted the reason for this is that locally if you have a very small proposal distribution the target distribution is locally almost constant it's well approximated because this is a smooth distribution now it will accept this point and what you can see is that this map of chain is now moving relatively slowly it is actually performing what is known as a random walk random walks have maybe you know some of the annoying properties which is that the rate at which this algorithm moves away from any particular point is proportional only to the square root of the number of samples rather than linearly to the number of samples so maybe you could say one way to fix that is just to make the distribution much broader but let's say I do that then what happens is that the algorithm will typically get stuck for very long times so almost every sample we propose now is going to be outside of the main support of this distribution and therefore won't get accepted so now the algorithm will keep adding more and more and more points to one single location before it moves and this means of course that even though every time it moves it now makes a larger step the time between steps is much reduced so how should we choose the correct step size for this algorithm well one intuition is that locally this or maybe in some sense this distribution has two different length scales if you like it has a long length scale that is basically the width of this picture let's call that capital L and it also has a short length scale which is maybe the width of this banana which let's call that epsilon and how to set this proportional distribution and how the algorithm moves depends on both epsilon and L to get a good acceptance rate we have to set the width of the proportional distribution to something like epsilon so this is maybe like this roughly speaking then the algorithm is going to accept a high proportion of locations but then it's going to go into something that is similar to the random walk as you can currently see what is that going to understand what that's going to mean for our algorithm I am going to stop this and move forward one slide and show you this picture again so here's a pictorial of what happens imagine this is a simple distribution and this plot is inspired by another plot that is in David Buchanan's book I made myself but a similar plot is in David Buchanan's book and this is a Gaussian distribution so a distribution that's actually very easy to go from but again it uses for diagnostics it's a very very strongly accommodated distribution so it's very thin the black dots are actual draws from this distribution which are possible to draw in most form for this simple distribution and the red dots are theropolis Hastings samples so the chain has started up here and then it has moved down here and started this random walk so as I just said this is actually quite a typical situation for general practical situations because in real-world inference alpha problems of course you don't know as much about the distribution you're trying to draw from at very best you have a very rough idea of what the width of the distribution is you use a circle of proportional distribution a Gaussian because you don't know much else and the tool distribution has a non-linear shape anyway let's say the problem has d dimensions and there is a large length scale L which in this case is this long line from here to here that's about one the length scale here it has a small length scale it's called a epsilon which here is about 4.1 now to get reasonable acceptance rate you have to set the width of the proportional distribution roughly to a small length scale otherwise almost everything is going to be rejected because in high dimensions remember we have this problem of high rejection rate in a high-dimensional space otherwise our acceptance rate is going to be exponentially suppressed by v we haven't even seen much of this in our two-dimensional pictures so far so if we take the width of the proportional distribution to be epsilon then the acceptance rate will be on the order of one so it will be okay but the topology's case algorithm will then do this random warp that I just showed you and random warps have the property that they are expected square distance travels their invariance over a number of samples because they are essentially sums the age like the Monte Carlo estimate so they scale with so the variance scales linearly with the number of steps which means that the expected distance traveled which is the square root of this is equal to the square root of the variance so the variance is epsilon squared times t so this is epsilon times the square root of sorry the acceptance rate times the the number of steps taken is the number of steps times the acceptance rate because they are they don't accept and they don't move so this is the expected distance traveled which means we want this algorithm to travel along the whole distance from here to here to get one independent sample so we get intuitively the time it takes to get essentially one independent sample or a plan for the map of chain to move back and forth by this distribution is capital L so if we plug capital L here on the left-hand side of this equation and then we arrange we find that the time it takes to make this step before one independent sample scales like one with the acceptance rate which is maybe okay because we've chosen R so it's on the order of one times the ratio between the large and the small scale squared so for a Gaussian this here is essentially the condition number on the matrix for those of you who understand this already which defines the any potential lines of this quadratic function squared which is bad because in real life this ratio might only be a large number and now we take the square of it which makes it an even larger number so this algorithm is kind of smart and I mean it's 50 years old so it's pretty cool that it has this property it's very general you can apply it to apply it to a global description but it has this flaw so we're going to have to fix this over the course of the rest of this lecture here's another picture this is the experiment I showed you in the previous lecture trying to estimate pi I've shown you the black line before this is exact samples from the real distribution we know that this quantum estimate converges like the square root of the number of samples here's this plot again in black this is the one over square root convergence rate by estimate that would be quickly the top of these hastings is going to be worse than that because its converges rate is lower bounded by this good behavior of the black line rather than this broad question of the good behavior of the black line at the very best this algorithm actually draws exact samples but in reality it won't it'll do it it'll draw exact samples with the rate given by one over the acceptance rate times the next scale ratio square so if you have a rational problem for estimating pi you can see that even if you want to have just a ballpark figure so if you just want to have an arrow that is less than one roughly then you might have to weight something like 2 over the magnitude longer something like a hundred times longer for the algorithm to reach for what to get to this point then you would have to weight which is already a long time for the exact sample and this is just a two dimensional that eventually the issue might be much, much worse so that's our longer weighted first grade slide monkey power methods in practice are usually Markov chain monkey power methods Markov chain monkey power methods are able to deal with community distributions that are only known or whose PDFs are only known after normalization even if you don't have a global representation of this distribution by using a local proposal distribution that only requires local operations it requires us to evaluate P or P tilde in the vicinity of a mechanical arc and it will use a Markov chain which over time asymptotically approaches the true distribution these methods converge at best as fast as exact something and in this naive fashion I just showed you they might still have a relatively bad mixing time so the multiple we have to take to get to get the kind of convergence rate might be very high especially in high-dimensional environments so to address this we'll have to build better methods here you can take a quick break to think about this problem and then we will return alright in the meantime I've noticed that the camera was recording from the wrong microphone so now the audio quality should be a little bit better apologies for that first Markov chain Monte Carlo method Metropolis Hastings it's a half a century old method that is nice in a sense that it is generic we can apply it to relatively general distributions we only need to be able to evaluate p the target distribution up to normalization so only p tilde at in a local region around where we are currently at we have noticed that there is this mixing problem which we need to address so we have to deal with this on the one hand we have to choose the proposal distribution such that the acceptance rate is relatively high but at the same time when we do so the algorithm tends to go into a diffusion one way to address this issue to some degree that is actually in certain applications still a very efficient algorithm is known as GIP sampling and it's given by the following abstract form which is a bit complicated to describe so I'm going to show you a visualization in a moment once I've introduced how the algorithm works the idea is if we are drawing from a multivariate distribution so from a distribution that has where the X has multiple indices which is usually the case then there are often situations this arises in particular applications of which we will see some later on in the course that the probability distributions are a conditional distribution for the Ith dimension the Ith entry of the variable and that entry might actually not just be a scalar entry it could even be a subset of the entries of X at time t that conditional distribution given all the other indices of the variable at the moment so this means we are currently at some location in the input space and if we now reason about only one or a subset of the dimensions X at this location so basically we cut through the distribution at the point we are currently at then that distribution is often one that can be analytically captured from which we can draw efficiently either because and you can already imagine when that would be the case that variable becomes conditionally independent of some other subset of the variables when we are conditioning in this way or in the most extreme case in the more general case because the ith element is just a univariate distribution so maybe in univariate problems we can use simple things like rejection sampling to actually build reasonably efficient sampling algorithms now GIP sampling works in the way that we keep doing this and we just basically sample these elements dimension wise one after the other so we take basically axis aligned steps within the coordinate system to draw our samples I'm going to represent this in notation by using this delta which is known as the Dirac distribution it's a very awkward kind of object conceptually because it's a probability distribution that doesn't have a measure and it doesn't have a density therefore but it's really just a notational tool that is very handy to describe what we are going to do so this object is supposed to be a probability distribution that essentially just means we're keeping these values constant so the probability to draw from the proposal distribution a new location x prime given x t is going to be that we keep all the other variables that aren't i so that by that I use backslash i which is everything other than i we're just going to keep those constant in x t plus 1 without i are just going to be the ones from x t multiplied by a probability to just draw the i-th element of this next step given the value of all the other elements if this is confusing I'll show you a picture in a moment then the probability for which we need to evaluate for the target distribution at this x prime is just going to be the probability for the i-th element given all the other ones times the marginal for all the other ones that's just the product rule that's just a fundamental property of probability distributions which is in particularly true at location t so if we now do this if we use this as a proposal distribution for the metropolis Hastings algorithm then we can find that this will have acceptance rate 1 so this process will always take that step why is this the case well you can actually just do this simply by looking at the form of the of the acceptance probability for metropolis hastings so here it is again this is just copied over from previous slides and now our proposal distribution is going to be this thing up here and our target distribution is given by the fact that we can write the target distribution as a conditional of the one entry in x-t that we care about number i given all the other ones so we plug in for this probability at the proposal distribution this factorization which works just to remind you for every i because of the product rule and then for the proposal distribution so x prime given x-t we use this form which is this notation notational trick that just means we're keeping all the other entries constant and proposing to draw just one more entry x i now notice that here a bunch of terms cancel so this term is in the numerator and denominator so it cancels and this term is in the numerator and denominator so it cancels so we're left with this expression and if you think about this for a moment then what is this for the i-th element given that all the other entries in non-i are given by x-t and yeah and we keep them constant so this is actually exactly the proposal distribution for the other direction which we need sort of up here which shows up up here in the acceptance probability so that's the probability to be at x prime and have all the other entries being equal to x-t and propose a step that goes to x t i so that probability is just one so therefore this algorithm is always going to accept and this idea keeps sampling is a way of getting rid of this issue that we need to find the proposal distribution that scales in the right way around the distribution to not get very low acceptance rate basically we guarantee one way to think about this process is that it finds a perfectly scale proposal distribution which has acceptance probability one the only price we pay for this is that we now need to do the updates axis aligned rather than doing them jointly as all of the variables in one and what that means I'll show you again on this visualization by whoop by G-Feng here's the code again now I've switched the sampler to GIP sampling and it's the same distribution as before and what you're going to see as like if I keep run this is the algorithm taking an entire GIP step so every time it does this it proposes here two different variables by taking first a step in one direction along this axis and then proposing a step in the other direction of the axis you can see that these are always axis aligned steps and there's always a green arrow because they're always accepted because the acceptance rate is one now you see that this is a relatively smart algorithm it's exploring relatively quickly the nice thing about this is that it always has the entire dimension to choose from by because it takes an exact step in along the axis you can also see maybe choose this is just an exact algorithm it's still a Markov chain Monte Carlo method though so it still has a Markov chain that has to mix and you can see why this is a problem now that the algorithm has reached these tails that first that it takes quite a long time to get to these tails because if you're up here then if you draw along this direction the probability to move very far down here is almost zero so therefore the next step which would then draw along this axis that could take a very long time to get to these algorithms so the algorithm has to slowly kind of diffuse its way down in this direction and then once it's down here it's sort of stuck in the other way that it now has to be in one of these two tails and in these tails it can't move up in this direction much so you see that this algorithm nevertheless mixes much better after a bunch of samples this one is not quite there yet I need to run the algorithm for longer so this is actually not that bad but there are of course also settings in which this method is particularly inefficient so for yourself if you want to stop the video here for a moment try and think of a situation in which this method is not going to work particularly well in case you've stopped and thought about it when is it not going to work well it's not going to work well if the distribution we're trying to sample from is really not access-aligned so if it looks something like this an allocated thing then the algorithm is going to take very bad local steps and one way to get Gibbs sampling to work well is to make sure you set up the sampler such that these conditional proposal distributions are actually relatively well access-aligned so let me stop this oops it's not stopped and go back and tell you that this is one efficient way to build a good algorithm it's not always great to have a distribution that is like this it's not an ideal algorithm it can actually work relatively badly but it is actually a method that four models that have this conditional independent structure or that have structure under which these proposal distributions are quite good can actually work really well and later on in the course when we talk about more structured probabilistic models we will encounter situations where this method actually suggests itself because of the structure of the model now for the remainder of this lecture I want to try and get us as close as possible to the actual state of the art in this part of probabilistic machine learning and to do so fair warning I am going to deliberately introduce a few things that at least at first sight are probably going to seem a little bit over the top for you in terms of their mathematical depth this is a deliberate step I know that some of the stuff I am going to show you will seem too fast and maybe too deep I am still trying to do that because one of the goals of this course is to empower you not just to use standard techniques but to develop your own algorithms your master students of machine learning you are supposed to become experts and an expert has to be able to build their own solutions rather than to just use existing tools and to do that you have to understand how the existing tools work you cannot just always use a toolbox without understanding how the individual elements of the toolbox work now that doesn't mean that you have to get every little detail of the derivation you just have to get an intuition for what's actually going on here we are going to talk about an algorithm that is clearly among the state of the art or at least it's the basis for the state of the art in parts of probabilistic machine learning a Monte Carlo method that like GIP sampling ensures that the acceptance rate of metropolis hastings is always one so we saw how to do that in GIP sampling by choosing the proposal distribution in a particular way but if you look at this expression and think about what you could do to get this number to be one so that every step is accepted then you might notice that one way to get to this is to set these two probabilities so the probability that we step to and the probability of the point that we come from to be equal to each other and they will assume a symmetric sort of reversible if you like proposal distribution now doing so would seem to require us to restrict the proposal distribution very extremely such that we move along equipotential lines of the probability of the probability distribution and that of course doesn't satisfy the requirements we've seen for the proofs for what nobody says things to work on previous slides but there is an ingenious trick that is based on the rules of probability so remember that the axioms of probability include the sum rule which essentially tells us that if there are additional variables in our model latent variables that we can integrate or sum these out and get marginal distributions how the trick we're going to use here is to sort of invert this process and instead invent latent quantities additional ones which are not normally part of our model such that we can ensure the acceptance probability to be one of the top of these things and still make sure that the marginal density over the quantities we actually care about is the one we're trying to sample from so that our proposal distribution has actually full support in these variables on everything we want and the way to do this is associated with the name of William Hamilton an Irish physicist who made great contributions to theoretical physics among them he was one of the first physicists to suggest the idea of conserved quantities and conservation laws to describe the laws of physics he was not alone with that but his name is sort of associated with this issue similar to maybe Helmholtz depending on what your national allegiance is so you don't need to know much about physics to understand how this works if you do know a little bit about high school physics there is a nice motivation that arises from Hamilton that gives this algorithm its name and it's an algorithm that is designed to be used on probability distributions that have the following form they can be written as so our P of X that we care about can be written as obviously a lowercase P maybe some normalization constant times an exponential of some function which we will call minus E so such probability distributions are called Boltzmann distributions and later in the course I will talk in more depth about the class of distributions that can be written in this way for the moment you can think for yourself well that this is probably a very generic class because notice that the exponential just maps to numbers larger than 0 so if you have distributions that are on some domain non-zero and you can guess that that's a lot of distributions then there's always going to be in some sense a function E that allows us to write the distribution like this of course there are corner cases and we will have to talk about those later on in the course but for the time being we just need to notice that this is a very generic class of distributions you might care about and of course it's also very easy to get E you just take the logarithm of P right up to normalization this E is a suggestive name it's called the energy of the system and this is motivated by the theory of thermodynamics that Hamilton made contributions to but you don't have to know much about this so the idea is going to be that we will introduce a new variable if we're trying to draw from a distribution over X the variable X we're going to introduce a new variable which we will call P and P is going to be actually it could be a number of it could be any kind of latent variable that we're going to use in a certain way that I'm going to talk about but the typical interpretation is that you should think about P as the time derivative of X so time derivative meaning I've actually defined it down here D so X dot means D X DT and we're going to call this variable P and now we're going to define a quantity which is called the Hamiltonian in physics as well which is and the important algebraic property is that it's a sum of two terms where the first term is E and so therefore it only depends on X and the second term is some other function that only depends on P rather than on X the physical interpretation if you like that is that this is a potential energy and this is a kinetic energy so kinetic energy depends on velocity and the time derivative of a location is the velocity and so if there are physicists in the room you know that there is a mass missing here let's just assume the mass is one and the energy only depends on the location not on the potential energy only depends on the location not on the velocity so that's your high school level physics interpretation and now we define the kinetic energy for example to be the square of the momentum now actually we don't need to ensure that P that the kinetic energy has exactly this form or in fact that and this is related to this that P is exactly equal to the momentum there's a more like what we really just need is two components the first one is that we need this Hamiltonian to be a sum structure why that's going to be relevant we'll see in a moment and secondly we need to impose some structure on the dynamics of this system so on the time derivative that our Markov chain is going to produce as a function of x and p such that this quantity h is conserved in time why? because this conservation law is going to ensure that our acceptance rate of the Markov chain is going to be one and the factorization property so the sum structure in here is going to ensure that it's going to be easy to compute a marginal distribution over x because what we're going to do is we're going to simulate the dynamics of some dynamical system that operates on x and p such that h is time conserved and then run essentially Monte Carlo on this and then consider the probability distribution that is given by the Boltzmann distribution up to normalization that is given by e to the minus this function h now because of this factorization property the marginal distribution over x is going to be easy to compute because e to the sum of something is just product of e to the constituent terms of the sum and because each of these are probability distributions we can integrate out p the other variable the marginal and just get because this term only depends on p and this term only depends on x it's easy to do this marginal if you write an integral around this and integrate around p then this term just pops out and this term just integrates to one if it doesn't integrate to one it integrates to a constant which we can absorb into the normalization constant and because this e will be conserved we'll find that the that the property say things will have acceptance rate one we'll see that in a moment so what are the dynamics we need to do we need to impose to ensure that this is a conserved quantity well they are given by these coupled to differential equations these are called the Hamiltonian dynamics and they are if you know about these terms ordinary differential equations so what I've written down here is I said the time derivative of x so time being the index over which our Markov chain samples so this is just a simple notation that is common in physics that is the partial derivative of x with respect to t should be given by the partial derivative of the Hamiltonian with respect to this momentum parameter p and the time derivative of p which is this should be minus the derivative of h the partial derivative of the Hamiltonian with respect to x why is this the dynamics we need well because this conserves h to do that we need to do a little bit of multivariate calculus so the total derivative of this Hamiltonian function h that depends on p and x with respect to t is by the chain rule dhdx plus dx dt plus dhdp times dp dt so that's standard multivariate calculus and now we plug in the properties of dx dt and dp dt which are x dot and p dot from our sort of required dynamics and you notice that these two terms are just the same why are they the same because products commute so this term is equal to this and that means the time derivative is zero which means this function does not change over time and therefore our acceptance probability of so here I've just copied over the term from previous slides now actually now we're going to simulate from a Markov chain or we're going to draw from a Markov chain which for a dynamical system where the time evolution of the variable p and the variable x is given by the Hamiltonian dynamics and therefore and this is sort of wrong I should write that down here or as a pen if we now do metropolis hastings essentially the proposal distribution that is given by q that simulates from the dynamics of this system then our acceptance probability is going to be p of and here I should introduce the variables px and sort of little p for the momentum over load of notation divided by p for probability over x prime and momentum prime times the proposal distribution so q of xp given x prime p prime divided by q of x prime p prime divided by xp if the dynamics of our dynamical system are what's called time reversible so the probability to move from one state to the next is equal to the probability to move in the other direction and that is true for the dynamical system then this cancels out and we're left with conserved quantities so because this p is e to the h up to normalization because this doesn't change in time this factor is one and this algorithm is always going to accept so this is maybe what's the next best thing to do so actually I should maybe conclude the mathematical confusion by saying how are we going to produce our proposal distribution q that draws from this dynamical system how does this work well to do so we need to solve a differential equation that is defined by these Hamiltonian dynamics and to do so there are mathematical tools that do that and they are called numerical algorithms simulation methods so we could go into deeper detail and talk about how these methods actually work this is an entire lecture course for itself called numerical analysis we're not going to do that but I want to show you just one slide to give an intuition for how this works these methods we need an algorithm that draws from sorry that simulates the dynamics of a system defined in x and p such that the time derivative of these two variables x and p is given by these two partial derivatives then to solve this we need to essentially find a curve f that satisfies these kind of quantities that are described by the differential equation and there are algorithms that do that we're not going to talk about how they work but I just want to know that these are actually relatively simple methods that recursively evaluate this function f which is here given by these dynamics of the Hamiltonian dynamics but they evaluate this and then recursively construct an estimate for what the solution of this curve is actually this process is quite related to what a machine learning method does there is an unknown quantity the solution of the differential equation which is sort of connected to a quantity you can observe and that's the value of this function f at various points and you now have to actively decide where to evaluate f such that the numbers of f you construct how you do sort of estimation from them let's not call it inference just estimation actually give a curve estimate that is in some sense good that is somehow close to the true solution it turns out that there are methods that look like that do this and here is an implementation of it this is actually a python code this is the entire Monte Carlo method this is Hamiltonian Monte Carlo it's a method that starts with an initial sample and then augments the parameter space with a each time step with a momentum that is drawn in such a way that the Markov chain is reversible and then evaluates both the energy function of that's our p of x you want to draw from gradient of the energy function so that we can compute our momentum and then in here there is a little solver for differential equations that just keeps iterating for over several steps so this is a for loop here to estimate the solution of the differential equation forward in time you don't have to understand what exactly is happening here what you have to understand is that this is a very simple code I'm not calling any advanced sort of package here it's just four lines of python to do that this is typically the take away I want you to get when I show you code that these if you're using a software package there is a danger that you get a feeling that there is some magic happening inside of the package in fact what most software packages actually do is that they implement a relatively simple mathematical function or a relatively simple algorithm or a relatively simple process like in this case so even though you've heard that a differential equation needs to be solved this is actually a very simple process it's just a for loop that updates as you can see very elementary objects with each other and then does a essentially a proposal so if this or the e-solver were a perfect method if it actually could simulate from the dynamics of the the Hamiltonian dynamics then this would always accept this Markov-Gemnauticalo method however we're using this numerical method to estimate what the solution of this ordinary differential equation is so therefore this acceptance probability is not exactly going to be one it's going to be ever so slightly off and this step here is essentially the metropolis Hastings steps which guards against problems arising from that instead of showing you much more of this code instead move forward and show you another piece of code from oops from Chi-Feng so now we can use the Hamiltonian reset and I'm going to start here and what you are seeing I'll just let it run I'll move it slowly is the individual steps of Hamiltonian Monte Carlo so in every single step the algorithm is currently at some point so that's our Markov chain starting somewhere and then this chain of points you see of black dots is the evolution of the dynamical system so that's a solution of this differential equation or it's actually it's an algorithm trying to estimate the solution of this differential equation which itself is an approximation to the true solution and then once it has run for a certain while it evaluates the acceptance probability which as you can see is always one because that's not quite one it's sort of close to one but it's so close to one that actually almost all and so far all of the steps of this method are accepted so with that this is a beautiful essentially numerical method to produce samples and you can see that it's in some sense much smarter than the Gibbs sampling method we've seen before and certainly much smarter than the Gibbs testings because it ensures that the system explores along the shape of the distribution we're trying to draw from by simulating essentially a massive particle that is rolling around in this potential and because it has constant energy it'll eventually reach every single point now I can maybe speed this up a little bit so you can see that it's sort of moving efficiently around and in this sort of dynamics it's exploring the entire shape of the distribution and adapts to the geometry of this distribution so this is actually very close to the state of the art there's only two annoying aspects of this algorithm that you might still want to fix before you sort of lean back exhausted and think okay well that's as good as it's going to get and those two problems are that this algorithm has three parameters and these are essentially the three parameters of this solving a differential equation part one of them is an an actual quantity that is a problem in the modeling we have to decide for how long to solve the differential equation so we need to say what one time step actually is so how far is the simulator supposed to run and the other question is what the quality of the ODE solver actually should be that's the step length of the solver and you can see that this is a problem if I actually let this algorithm run for longer then you can see that what happens quite often is so now the algorithm is exploring really well across the domain this is maybe a good choice for this particular problem but if I now increase the step size because these are I mean this method is clearly doing a little bit too much okay let's take the step size a little bit shorter you can see what the algorithm does is that it often starts simulating and then I've sort of inefficiently chosen the length of the simulation relative to the overall geometry of the problem so that typically the simulation at some point turns around and returns almost like much closer to where it started than the maximum distance it has traveled at some point so in fact there are choices of this step size that are even worse that often we get us basically back to where we just started this behavior is in current literature called a u-turn of this sampler because it basically that's what it is you can see it move back and forward now this essentially just means to fix this kind of issue we need to find smart ways to choose the parameters of this algorithm the step size and the time that we're going to simulate and this is in fact a very general problem with any numerical method so this is something if you haven't experienced with numerical methods you should know that this always happens with any kind of computational tool there's always going to be some hyper parameters that you somehow have to set and you don't have to set it manually like in this setting you want to have a smart algorithm that does that for you and you will invariably find many papers by smart people who've thought about how to do this and they've come up with beautiful solutions one of them for this particular setting is called the no u-turn sampler also abbreviated as nuts it's an algorithm that is now almost six years old already so it's not maybe not totally state-of-the-art it's quite close to state-of-the-art though it was published by Matt Hoffman and Andrew Gelman in New York and it works like this here is actually it's actually in here let's use this one a smart implementation of this method this is an algorithm that the very basic idea is it just runs the Markov chain in two directions so it creates a momentum and then also inverts the momentum and runs the simulation in both directions and basically tracks statistics of how far these two chains are traveling before their distance mutual to each other starts decreasing again so before they basically turn around and start getting back to each other and then it takes one of those two sides and decides to take that simulation of course all of this has to be done with care so that detailed balance and accuracy are still satisfied and if you want to know why that's the case you can read the original paper I don't want you to get that deep into the algorithm development I just want you to understand that this is the level of care that is required to build good probabilistic algorithms or probabilistic inference algorithms so I'm just going to run this maybe for a few more steps actually not this that's the wrong thing to change I want to change this and now you can see this algorithm run efficiently to explore and move around in the input space this kind of method is one of the state of the art algorithms before I go to the final slide I should tell you that of course you do not really have to implement this kind of algorithm yourself if you want to use it in practice like other parts of machine learning in the past few years also probabilistic inference has become a domain of computer science in which we see more and more standardized software packages and there are available software packages for you that allow you to use these kind of methods to draw from probability distributions and I don't want to advertise any of them I just want to show you a few well that basically amounts to advertising them I guess one package that you might have heard about before already is called PyMC this is an open source Python package that implements various Monte Carlo methods and it also includes of course methods like Hamiltonian Monte Carlo and its variance that adapts step sizes this is a specific tool just for sampling but these Monte Carlo methods are also a key part of the notion of probabilistic programming that we're going to talk about more over the course of this course and there are entire packages that allow this kind of notion of efficiently defining a probabilistic model and then running inference on it one which is quite popular with practitioners is called STAN in fact the no uterine sampler was probably invented to make STAN work one way to describe STAN is that it's probabilistic inference with a no uterine sampler and if you want to have a look do so just google or just go to mcsan.org another more recent package that I also don't want to promote or so but just you realize that there are several such packages is called Pyro it's developed as an open source package mostly by Uber and Pyro is a general package for probabilistic programming that also includes a module for Monte Carlo methods and so here is the dock page for it this is a generic way to implement a Monte Carlo method and then you have to define a kernel so you have to decide which Monte Carlo method to use and there are a few to choose from mostly HMC which is short for Hamiltonian Monte Carlo and where is it whoop and nuts which is this no uterine sampler which as I said is a variant of Monte Carlo so I wanted to show you these software packages because there is a perception in some parts of the sort of further away from the core of the community that only deep learning is the only domain that has these beautiful software tools like TensorFlow and PyTorch and so on that's not true there are many cool software packages for probabilistic inference as well these are just a few that I just wanted to highlight there are many more and in fact this is a very lively field of scientific development and maybe by next year when I might do this course again I might have to update these slides because by then there might already be a new cool software package that automates probabilistic inference so what we've done today is we've seen that Monte Carlo methods are a generic way of solving integrals in the naïve fashion they converge with one over the square root of the number of samples to an unbiased estimate this is very good but it's hard to implement in practice how to actually draw these samples if you want to do this efficiently therefore there are approximate methods called Markov chain Monte Carlo methods which consist only of local computations which break unbiasedness but produce a sequence of samples such that over time if you let the chain run for a very long time and then scramble the samples you hope to get samples that are actually IID distributed and come asymptotically drawn from the correct target distribution we haven't talked about how to diagnose that that's the case or not that will come at a later point as well these methods are have the well one thing that's important maybe to understand is that building these Markov chain Monte Carlo methods is itself actually a task you have to understand well it's not entirely automated but there are processes that help you automate this design process in particular there are certain families of Monte Carlo methods that are generically applicable among them are a Gibbs sampling well Gibbs sampling is reasonably generic it's particularly efficient for probability distributions which have interesting conditional independent structure so that you can design these axis align proposal distribution sufficiently and then there are methods like Hamiltonian Monte Carlo which work on generic smooth distributions to by simulating dynamics from an expanded variable set that includes the one the sample variables you want to sample from an additional variables additional variables which are with the algorithm basically keeps track of and then marginalizes out for you these algorithms are currently among the state of the art they require careful parameter tuning but there are beautiful algorithms to do so and of course I've showed you Hamiltonian Monte Carlo this is not the only Markov chain Monte Carlo method that is state of the art there are other ones out there the notion of sequential Monte Carlo is also very important we don't have time to talk about that today what I wanted you to get out of this is if you need to draw if you need to do probabilistic inference you need to solve integrals to do so you can use Monte Carlo methods you need to use good Monte Carlo methods because otherwise your Monte Carlo method might not work well at all these good methods require a little bit of understanding of what's going on and they are available in software packages so you don't have to implement them yourself but if you know what they do this can help you build really efficient implementations that need way way way less samples this is important because Monte Carlo methods even if they are exact so if they are not as approximate like Markov chain Monte Carlo methods at the very best converge like one over the square root of the number of samples so stochastically if you like and to make Markov chain Monte Carlo get close to this already maybe demanding computationally demanding performance you need to know a little bit about how to design the Markov chain well and how to structure and set hyper parameters the takeaway from this for you maybe is that actually doing probabilistic inference we made quite challenging if you want to do it generically maybe that's worrisome but then if you think about how hard it is to get a deep learning model to work maybe it's okay if this beautiful framework that allows reasoning with uncertainty and encoding structured prior information requires a little bit of hard work as well the reason why it's so hard is that we're trying to solve an integral such as fundamentally a hard task that after centuries of research quite in contrast to differentiation it's opposite is still an unsolved problem with that I want to briefly remind you that there is a new exercise to look at today and it's actually the second version the second week of the project you started in the past week to build a battleships agent an agent that can play this tabletop game this week we're going to move from a single ship on the board to multiple ships on the board and actually try to build an agent that plays this game against a human or against another agent and to do so you will have to think carefully about how to address the combinatorial explosion of complexity that arises if you have multiple ships ships on the board and how to implement the resulting inference efficiently maybe you might get to use some of the Monte Carlo methods we've spoken about today when trying to solve this exercise I hope you have fun with that exercise and thank you very much for your time