 Welcome everyone to lecture number nine for the last eight lecture or actually seven lectures You've got to enjoy well-prepared Beautiful lectures by the PhD students now for the next two lectures. I'm going to take over and I'm gonna be less prepared I can tell you We've been talking about Linear algebra first Guess that makes total sense because in your algebra is the foundation for everything You can't really do computation if you don't know how to solve linear systems of equation And then because it was natural to extend from the point of view of Gaussian process regression We move to simulation and it turned out that simulation is actually quite a complicated problem. So For the next two lectures We'll actually I well first of all I can tell you you've now passed basically the heart the hard stuff So from now on the rest of the course should be if you've made it this far. It's gonna be fine for the rest of the course and What we're now moved to for the next two lectures is ostensibly integration and the reason We're doing it at this point was well case simulation was like naturally to follow up from Gaussian process regression Integration will actually even have a slightly different kind of character will use it Less as a direct means of some to get to some tools you absolutely need to know but more to reflect a bit on the nature of computation as such and How to do it right So when I say integration I mean the kind of computations that we need to perform in machine learning when we do Bayesian inference To do Bayesian inference you have to compute conditional distributions postivios using Bayes theorem and Bayes theorem contains an integral and It contains a joint distribution and that integral is essentially kind of a marginal right so you can think of this marginal also as the an expected value of some conditional distribution and Then of course expected values more generally are also quantities that we want to know if you've done if you've done Bayesian inference You want to know afterwards. What is the expected value of some quantity? What is the mean? What is the variance? What are co variances? How do things depend on each other for to answer these kind of questions for general distributions? We always have to compute quantities that look a little bit like this So there is some function Which we care about If you wanted to compute the mean then that function is just a linear function If you wanted to compute a variance, then it's essentially the square the quadratic function And then there is some underlying distribution piece on probability density or probability measure in general that we need to integrate over So how do we do this? and Today we're going to focus on one very particular way of doing this called Monte Carlo Quick check hands up who is used Monte Carlo before Not it. Yeah, everyone. Good. Not everyone actually there are like three people who didn't raise their hand Just haven't woken up yet. Okay, so you all know about Monte Carlo That's good because you can do a really quick run through and what I want to do is Okay, so one one thing we're probably gonna end up doing anyway It's like a quick recap of how Monte Carlo works from the very simple idea to the most advanced algorithms out there So just that we are all on the same page on what kind of algorithms were working with But what I also want to do along the way is to reflect a bit on the history of how these algorithms emerged and How they are used because the main message for today spoiler alert. It's going to be there isn't really a Thing such a thing as just the integral and or an integral and one type of algorithm to solve them all But whenever you're facing an integral it makes sense to think about why you're computing it And when you're using a certain algorithm to do so it makes sense to think about why you're using that So here's the history of Monte Carlo. It starts in the Second World War with Actually several different people. It's a bit tricky to make out who actually came up with it But two that are really maybe good contenders are Stanislaw Stanislaw Ulam many Polish people in the room probably mispronounces already and John von Neumann Everyone who studied ever computer science knows right from World famous for various things including his the von Neumann architecture Working in the US at the Manhattan project. So some physicists have come up with this beautiful observation that Nuclear of atoms can be split and they create neutrons and there is some kind of idea in the air that that could be you possible to build a bomb from these kind of From this kind of reaction. How do you do that? So the story goes apparently that if I remember correctly Ulam was in the hospital and von Neumann visited him and he chatted about you know building bombs as you do and the problem they had was that So the reaction is right. There's there's an atom it gets hit or actually it absorbs a neutron and Then it splits and produces new neutrons and those neutrons are fast and they sort of spread out So what you'd like to do is to use those neutrons to split more atoms, right? That's a chain reaction, but if you just put a bunch of Visible material together in a ball that doesn't actually happen So the question was how do you what's the right geometry? How do you arrange this material so that you actually get an explosion and that's apparently surprisingly hard It's not just well, okay If you make a ball that's big enough then yes, but then you already have the bomb exploding in your lab You need to such that it only explodes once it's over Japan in this case. So They had no computers basically to speak of so they had to think of geometries and that's hard So they came up with this idea Notice that this is kind of an integration problem. There's a bunch of neutrons flying around you But yet you can you can optimize the geometry and you'd like to get an expected value for the density of new neutrons to be maximized in some region So you need to come need to compute this statistic this expected density of neutrons So how did you do that they come up with this idea that we can approximate such integrals with sums Over evaluations of the function we want to integrate over if The points where we evaluate the function are drawn from this distribution and we'll see in a moment Why this actually works, but first of all, how did it how did they then proceed? Well, they built a little computer an analog computer. They look like this it's called the Fermi analog computer Fermiac and It worked. So first of all Well, okay, so you can see that it consists of three things basically two wheels and a pen and the pen has lines attached to it so we can measure directions and Those two wheels correspond if I remember correctly to two different settings One is that there are two different types of neutrons to simulate Thermal ones and fast ones and that's what these two wheels do you can see there's a big wheel and a thin wheel You can switch between these two modes. It's like a like two modes of computation And then this thing which is difficult to see here in this picture from this direction has like several different Rings of decreasing thickness Which correspond to different types of material that you can simulate in and so the way this works is now you take Sorry bad pictures the best one you can find online of course you have to wear your black gloves because it's 1948 and You you hold this thing over a drawing of a geometry. You're trying to test with a bunch of dice to create random numbers and then you simulate the path of a neutron by Throwing a die to multi to get at this tip at a sample from the distribution of the initial energies of these neutrons That tells you how far they're going to go depending on the material they are in you're moving this thing this machine on the surface until These wheels tell you that you've now used up all the energy and then when you cross between different parts of different material you change the mode to keep counting up how much energy you've already lost and then you Throw a die again to compute Probably or to randomly draw what kind of event happened that happens next when you get a fission or some other types of events it's been a while since my nuclear physics classes and Then you you simulate like a hundred of these paths basically so afterwards this picture is completely covered in black lines and That gives you a distribution an empirical one for what the neutrons are and you can look at it And they are maybe I need to have more of a bend over here something and then we do this whole thing again And if you do this often enough a you manage to build a nuclear bomb Which was well. Yeah, so this is maybe a pretty big one Which looks more like a new like a hydrogen bomb you're getting two hydrogen bombs in a moment Because that's the other kind of inventors for these smart smarter kind of computations so The reason I'm telling you this story apart from the fact that it's a nice story is that notice how this is actually a simulation method We're not Initially trying to solve some integral. We're actually trying to simulate some process Right by creating an algorithm that produces a simulation of a stochastic process, but in a very Kind of pedestrian literally well kind of hand-walking way because there's no computers to use yet So why is this a good idea to do this kind of computation? well, so The idea is to say it again. We're interested in this quantity, which is an expected value So it's like the average number of neutrons or the average location of the neutrons or the average length of the path and so on and so on and We're going to replace this integral by this thing which is Some instead of an integral so a finite object instead of an infinite object over Evaluations of this function at points that are drawn from this distribution peak and this drawing is what happens with the dice in the case of the fermiak So why is this a good idea and this is the part you've probably all seen in previous lectures because this is an unbiased estimator So what this means is that if you compute the this is now a random number, right? Remember that this is just a number. It's not a not a random number. It's just a deterministic quantity It's an integral It's a kind of thing you learn in high school and this thing down here is a random variable It's the thing you learn about in university But this random universe this random variable has a beautiful property or several beautiful properties One of them is that is that it's unbiased so the expected value of this quantity is equal to This quantity we care about to see this well We just compute the expected value of this quantity these xs are drawn iid from p So the expected value of the whole thing because they are iid is the sum over the expected values of those quantities Each of these quantities each of these terms here is just phi So we just have some as sum over s copies of phi divided by s which is just phi So now let me quickly check is this too fast or too slow for you This is too obvious. You've seen this before Okay, so there's a good and good nodding. So we're we're too slow. I'll speed up a little bit like why is So so we now know that this random thing we care. We're trying to compute on average has the correct value But every single instance you compute has the wrong value, right? Because it's a random variable So the next question is how far from the truth is it on average given that on average. It's correct So if you compute that to compute the variance of the this this random variable And now have you seen this computation before? Do you know why I'm getting to very good? No, not everyone But most people okay, so I'm going to save you this complicated thing of Expanding two sums and trying to compute the variance of this estimator will see that this variance is That so the expected square distance between the random variable Using to estimate the integral and the actual integral is some constant given by the variance of That function so that's the expected value of the square of f under the integral divided by the number of samples So that means if s increases The error gets small and that seems to be a good thing, right? So we have an a random number that on average is correct and if we expand more computation it gets ever better Okay, so that is actually the entire argument for why these algorithms are a good idea to do and Now there are two caveats. The first one is this is actually good. So is this a good rate of convergence? Yes, so that means it just like this of course this means that the this is the expected square distance So they expect the distance drops like one over the square root of s because it's the square root of this thing But okay, so question again. Is this good or bad? So it depends on who you ask if you're asking a statistician They say it doesn't get any better Because there is no unbiased estimator that can converge possibly faster than this. This is the min-max optimal rate for unbiased estimators So there's no way to get a better estimate But if you're asking a numerical Mathematician they say well, this is like the worst possible rate you could could imagine. It's not even linear if you're looking for polynomial rates with these Simulation methods that Jonatan and Nathanael and Mavi told you about we spoke about convergence rates on the order of Amount of computation invested raised to some power positive power, right? Not to like mine not not to one half, but to like four five six whatever so That seems like a weird discrepancy, which we have to investigate at next week We'll talk about methods that have these good convergence rates But they'll have their downsides and today we'll talk about this thing and why it might still be a good thing to do So the other observation is That is often usually made at this point and I used to do it all the time as well Is that the nice thing about this is that this here is just a constant and notice how there is no D in here where D is the dimensionality of X Right so X is X could be a big vector could be millions of dimensions and there is no Reference to the dimensionality of this problem in here. Is this true or not? So here people disagree again There's actually a threat that went via social media recently and got pointed out to me because they are citing me at some point So I had to look at this again comes from this block I recommend that you have a look at this blog and read it and you can click on the link in the slides this underline is actually a link and I have to say I think there is a bug in that argument on in this block a Harmless one that doesn't actually change the argument But there is still a bug in there if you can find it send this guy a comment give him best regards like from Bob the lecture So his argument is well so it looks like there is no dimensionality in this problem but of course there is this variance of F under P and We could think of a situation where that variance has something to do with dimensionality And that's a bit it's a bit weird right because we're thinking of this F as a given function So it's just something it doesn't really depend on Dimensionality as such it's just you have given a function right but of course there are settings where you can vary the dimensionality of the problem and You can construct situations where that as a function of the dimensionality if you stick within some structured kind of problem That you where you can vary the dimensionality you can make this variance explode really badly And that's his argument and I actually started writing down the full derivation Then I realized that there's probably a bug in his derivation. So I think his derivation doesn't quite work, but it's essentially true that if you if you just take care to construct F in such a way that it Interestingly behaves in dimensionality that you can make this variance explode really badly including exponentially fast however typically We're thinking of F as the mean So the linear function X or the quadratic function and then this usually that isn't actually a problem. So Long story short I had to be really careful about this because people For example also cite me and claim that I'm not making this point. So I'm gonna make this point very clearly. Yes that Conversions rate of Monte Carlo is not independent of the dimensionality of the problem, but For most interesting applications of this kind of way of solving integrals It is very insensitive to the dimensionality of the problem In the sense that you can use Monte Carlo sampling to solve very high dimensional problems of computing expectations And the methods we're going to talk about next week will patently not have this property They will not scale well to a high dimensional problems in a similar way to how Not actually no, I'll not make that point because it's probably technically just slightly incorrect so That means it scales but it also behaves Converges in a questionable way with the number of samples that we are drawing with the amount of compute we're investing into the computation So here's the classic example again just to make sure that we all talk about the same thing So let's say we wanted to estimate Pi which is a stupid thing to do because there's fast ways to compute Pi But one way to do it would be to draw random numbers on the unit grid Hmm. There's this line of Python. Here's a definition of Pi You just draw random numbers you square them all element wise then you sum them always in pairs Whenever that sum is less than one you're inside of the circle whenever that sum is larger than one you're outside of the circle The area of this quarter circle is Pi over 4 Because if this is a unit box then this is a unit circle which has area Pi r squared r is 1 so the area is Pi We have a quarter of that so it's Pi over 4 right and So yeah, if you multiply by 4 the average value of those this binary random variable of being inside or outside You get an estimate for Pi if you plot this the absolute error of this Way stupid way of computing Pi as a function of how many samples we draw you get this plot on the right and You can see it Converge towards the truth as a function of the number of evaluations And it converges with a rate that is given by one over the square root of the number of samples Those are these straight angled lines that I've plotted in I've also thought it in the expected value for this which is the variance of these Vendor variables by the way, what's the how do you compute the variance of this just a vicinity check? How do you think I got to this black solid black line? So for that I need the variance of this function f divided by n right so divided by n is easy That's how the plot comes about but what's the constant? So this is a Bernoulli random variable that we're computing here. It's either one or zero But it's one or zero with a certain probability and that probability is Pi over four So it's Pi over four is the probability and what's the variance of a Bernoulli random variable? Which happens if you have a Bernoulli random variable that is positive with probability P. What's its variance? Yes, P times one minus P. So it's pi over four times one minus pi over four And that's exactly what this how I got to this black line. Okay, so now This means with a very small bit of code like with this one line of code with a tiny amount of compute Nine to ten samples. I get a number out that is correct to one order of magnitude So around ten samples we get something that crosses that ten to the zero line, right? Where we get sort of three on average, that's good before that we get like basically ones and zeros and shini and You might times four right so basically four zero four zero and then suddenly better And so that's kind of good, but if you wanted to have a really high precision answer like You know the first 100 correct digits of pi then well you can extrapolate this line to get to minus 100 and that takes a while Probably longer than we have time in the remaining universe to compute and as simple as this example is that actually contains everything You need to know about Monte Carlo That if you're faced with a problem like these nuclear physicists were we have just no other way of computing something and it's Really unwieldy the thing you're working with and you just want to have a rough estimate Then this is a very smart thing to do because it's often quite easy to construct a simulation To draw from the distribution and then just watch the simulations and kind of see what they typically look like basically looking like Sam like looking at samples of you know probabilistic machine learning models But it's very hard to use this kind of computation to get a precise answer so If you're facing an integration problem in one of some of your machine learning work Then if you want to have an answer by this afternoon without having to work much on the math This is a smart idea, but if you want to have a precise answer This is not a thing to do and it may seem completely obvious especially if you've had a probabilistic machine learning class, but I Think this this kind of insight isn't really taken sufficiently seriously by most practitioners They're still very very many people out there who think Bayesian inference is always Monte Carlo, and it's always very expensive And I just I think fundamentally comes down to this insight You're thinking well to do Bayesian inference I need to do Monte Carlo because that's the only way to compute integrals Right, and that's very expensive because you have to run it for a long time And that's not true because there are other ways to approximate integrals which can also work really well They're just typically more work to get to work, but then they can potentially be very fast So that was the easy bit. That's just what's called simple Monte Carlo You actually draw from the distribution and then you follow you just Stamp simulate a lot and compute expected values under the symbol under simulations In practice, of course, you can't actually draw from P quite often and so we need ways To construct samples from a distribution that is fundamentally intractable So once you have a tractable distribution Monte Carlo turns the intractable integral into a tractable sum But for that you first need a tractable distribution from which you contractively draw samples and the classic way to do that is What before I go to the next slide? Okay, good. I'm slowly catching up with your knowledge. I guess so there are standard ways to take uniform random random numbers and turn them into samples from some other distribution that Work well Well, okay up until two years ago I would have said which work well for very simple low-dimensional problems and these are essentially so direct transformation So sometimes for certain one-dimensional distributions You can fiddle with pen and paper and find ways to turn uniform and random numbers into Gaussian random numbers into gamma Distributed random numbers into beta distributed random numbers and so on and so on Okay, so that's very very basic that what that's what happens inside of your random number generator Subroutine numpy dot random is full of these methods, right? Okay, fine, then there are ways like Rejection sampling an important sampling which take standard Typically Gaussian distributed random variables and kind of use them to approximate other distributions those don't scale well to high-dimensional problems only to two or three dimensional and quick Like injection at this point. This is what the story used to be is probably the story you Got to hear in your poverty machine learning class as well But there are no methods that can actually draw from extremely high-dimensional very complicated distributions Essentially by transforming uniform random numbers and they're called score-based diffusion models This lecture is not about score-based diffusion models They're very new and you've all seen them, right? They drive things like stable diffusion They're essentially just that they're just direct transformations of uniform random numbers through a really complicated non-linear function some neural network essentially But we didn't have quite quite have time to get get to that story In for this lecture also Or maybe at the end of this lecture I'll ask you again why we're not using stable diffusion yet to draw from Bayesian inference problems And you can think for the rest of the lecture if you were bored by what I'm going to tell you next why that is the case Okay, so what is the standard way to produce random variables? That are distributed according to some higher-dimensional distribution is called Markov chain Monte Carlo And that's what we're going to talk about for the rest of this lecture Markov chain Monte Carlo you probably heard the word Markov chain several times so here's the picture again that was also in your Natanz and Natana is lecture methods that are based on a Structure that moves that steps iteratively forward with a finite memory That's what a Markov chain is right So it's there's some data structure where you do a local computation that only depends on some finite local state Typically that finite local state is actually the sample. We're trying to draw But we'll get to a point where it's a bit more than that just like in simulation. We also got to a point where it's a bit more than that and the methods that That there's essentially actually only really just one method of this type Which engenders all the other ones and it was invented by? The people actually I'll go directly to it because you've probably seen it before it was actually by invented by the generation of nuclear physicists who wanted to build hydrogen nuclear bombs Thermonuclear weapons rather than classic nuclear fission weapons in particular the names that are connected with these people are Edward Teller and the husband and wife Marshall and I think she's called. Oh, I forgot her name already as So I looked them up before this lecture and now I just got the first okay so husband and wife Rosenblut who worked on and nuclear weapons in Los Alamos in the 1950s 1954 and so on so they're slightly more complicated geometry is to optimize and they had to use number generators that have a bit more advanced and together with The mathematicians metropolis and Hastings they ended up coming up with this algorithm Which you've probably seen before the metropolis Hastings algorithm Which to remind you works as follows So this we're constructing a Markov chain. So we're going to have a for loop wrapped around what's on the slide That's the important bit and now what happens inside. So we go We are at a current state and now we're Drawing from some proposal distribution called Q a new location and X prime which so the typical choice to have in mind is that Q is a Gaussian distribution around your local point and Then at this proposed location. We're computing this What what initially looks like a weird ratio between? The distribution we're trying to draw from At the proposed distribution and the current one multiplied by the ratio between like the proposal distribution with reversed arguments That's the general form typically. This is a Gaussian distribution So Xt is the mean of the Gaussian and X prime is the argument of the Gaussian And because Gaussians are symmetric around the mean because there's an X prime minus X in there squared Which you can reverse Those two quantities are actually the same So this is just one and it cancels out and we only care about the ratio between P at X prime and an Xt If that number is larger than one we go to that point so that means if P of X prime is higher than what we currently are we always go there and If it's lower Then we go there with a probability that depends on the ratio between the two So if that step that it has been proposed is extremely low It's extremely unlikely to be accepted and if it's very essentially the same value Then the probability is very high to accept it. So one intuition for this is that it's a little bit like optimization with the probability to go downhill and how far like the probability to go downhill depends on how far you're going downhill but actually the best way to get a sense for How this works is to use a beautiful Use a beautiful animation That I'm just gonna start and move over. I should have started this before. Sorry. So Let me go here and move this up That doesn't come from me comes from a guy called Chi Feng So here is a probability distribution We are going to do Metropolis hastings and we're starting at some point initially here And now let me see if I can make this a bit Smaller so initially we're at some point. We're drawing for a possible distribution a point in here and then check what the ratio is and That's probably not going to be accepted then we draw a new point that also doesn't get accepted and a new point Oh, and that's now going to get accepted because the probability density at this point is higher than here So we accept the probability one and you go there and then we Keep doing this many many times and If the algorithm runs for a long time, it's gonna produce points that get us Is that produce a bunch of a bunch of points first? I'm not calling them samples yet because we'll have to talk about why those are acceptable samples in a moment okay, and Maybe while we're staring at this plot because it's kind of nice to look at by the way. Have you seen this animation before? No, yes. No, okay. Half of you have and then Maybe one thing to to notice is just sometimes it's really good to look at visualizations of algorithms like this because you can immediately see inefficiencies so one thing is whenever this thing Stays at some point you can actually see the histograms move while it stays at that point This is a part of metropolis things. It doesn't reject a point It just adds it to the count so far just stays at some point but keeps adding it to the count and then We can maybe notice that how well this thing explores depends on how well this Proposal distribution is adapted to the entire distribution. So for example, if you make it a bit smaller Then it's gonna move around more Because it's typically going to be proposed in a region that is close to the current one So if you're already in a Characteristic region of the distribution then will highly likely highly likely to accept But if you make it too thin Then we don't move so we accept every point, but we're basically stuck in some regions. We don't explore So this is essentially an exploration exploitation kind of problem, right? So we need to manage find a point where the algorithm moves as much as possible while Also accepting as much as possible Okay, let me stop this And go back. So why is this an acceptable thing to do this algorithm? Why does this actually work? So this algorithm has two interesting properties, which are the properties you always find in papers on new in newly invented Markov chain Monte Carlo algorithms that are called detail balance and ergo dcd Who has heard of detailed balance before? Now we're getting somewhere. Okay, so detail balance is this property which says that the probability to be at some point and to move to another point if P is actually the distribution we're trying to sample from is Equal to the probability to be at some other point and to move from the other point to the previous point So if P is the correct P Then the algorithm is as likely to be at some point and move to the other than to move in the other direction Right, so to be at the other point and to move to that direction and to show this is actually quite straightforward from a Things it's much harder for more advanced algorithms So we literally just plug in the definition, right? So the probability to move from x to x prime is the probability to propose that point and then to accept it We accept it with minimum one and this then we just move these terms into the minimum We rearrange move them out of the minimum again And we find that we see the thing the same thing on the left on the right hand side of the of the equation We'll find so the beautiful thing about this property is that it means This process this entire process of running this for a long time has a stationary distribution Which is given by P so that what this means in math is if you look at this expression down here So that's the that's do you think of this as an operator acting on this entire function P Right, so this is this T of x and x prime this transition operator the probability to go to x prime If you are with x, that's a conditional probability distribution But you can also think of it as an operator that that kind of gets convolved here with x. Sorry P of x So what is this we're here in this integral we can buy this line We just showed we could we can replace this expression with this expression That's literally just from above then we notice that P of x prime doesn't depend on x So we can take it outside of the integral and this here is just a probability distribution because it's the probability to go To any value x given that you are at x prime or this distribution integrate to one So we're left with P of x prime So if you have a set of random numbers that are distributed with a density given by P of x and then this Algorithm operates on this entire sort of in your head infinitely large set of random variables random numbers Then they don't change the density of those points So this is like a station where it is a stationary point of this operation So algorithms that have this property have at least one such stationary distribution Then the only other thing you have to check is that there isn't another stationary distribution that you might excellently get stuck in And this is the slightly more tricky point that is a can actually be Formulated in this notion of what's called ergodicity And this is always a bit difficult to explain because it's it's a little bit self-referential it's kind of Just writing down the fact that there is no Structure that not insufficient structure in the algorithm to get stuck in some other Stationary distribution so ergodicity of a sequence means that it is a periodic it doesn't contain a recurring sequence and It has positive recurrence that means if you've once been at some point then there's Non-zero probability to come back to that point at a future point I Processes that have this property have only at most one stationary distribution and our algorithm has this property and here is that that's the bit that is kind of Nearly pointless to prove it's so I can give you a hand-wavy answer. So why is it ergodic? Well because we're drawing Randomly so there's no periodicity But that's really just saying what I mean by random, right? It's just it is not periodic We're not going to do the same thing after eight steps because that would mean we could predict what's happens after eight steps, right? And and it has positive recurrence that is actually correct for Metropolis Hastings with Proposal distributions that have full support that could potentially move everywhere because then there's always a finite probability To move to the point you've previously been at. Okay, so by this property This algorithm fulfills these two properties that it has at least one and at most one stationary distribution So that means if you run it for a really long time It has to be in that one stationary distribution and that is the distribution We're trying to sample from the only problem with this statement is that it's just an Existence statement there eventually it'll get to this stationary distribution But we're not seeing how fast And so remember for simple Monte Carlo. We had this square root convergence Which you already thought was a bit bad And we saw but we also realized it's actually the best rate you can imagine So this algorithm is going to have a worse rate and If you do this so you actually typically get a behavior like this So this is a kind of insight that comes from David McKay's wonderful textbook on machine learning that such algorithms often have these Kind of local random walk behaviors you also just saw it in the animation we looked at and that means we actually have to think of a Random process that moves on top of the random process. We are already trying to draw from and The number of effective samples that this algorithm draws have something to do with this kind of free way free Step length or free time length between two samples at completely opposite ends of the distribution One way to think about this is to say if you have a distribution that looks sort of like this That is very elongated. That's typical for Non-trivial distributions and high dimensions. There will be one length scale that is the largest in this case It's this sort of diagonal from bottom left to top right Let's call that capital L and there'll be one length scale that is the smallest Let's call that the one that is orthogonal to it. That's tiny And now if you make the proposal distribution much wider than this width as I've just done in this in the in this animation before Then the acceptance rate will be very bad. So if I make this really big, right then The distribution it will mostly not accept points. So it'll stick around for a long time And if I make this acceptance if I make the proposal distribution very small then every point will be accepted But we're not going to move at all. So probably we're going to set our step size at something like the small length scale epsilon and then we're gonna get a random walk on a Scale of epsilon so random walks Keep a move a distance of square root number of steps So the number of steps is the acceptance rate times number of steps taken, right? so that the average distance traveled in time t is the proposal with times the square root of acceptance rate times number of steps and that means if you solve This number if you put a big L here so if you want to travel from one end of the distribution to the other and Then rearrange this equation to get an expression for t you get that the number of Time steps you have to wait to get effectively one sample is Square of the ratio between long and small length scale so if you're thinking of a covariance matrix of a Gaussian Then L and epsilon are the largest and smallest eigenvalues So they are like condition numbers of this of this matrix and it's the condition number squared So that's pretty bad and that means you'll get convergence rates that are still O of square root of t but with a huge multiple in front and you can see this here, so The plot on the left is actually just like another way of looking at this plot So in red you see the Monte Carlo estimate for the means of this distribution Converged to the correct value which in this case is zero why because just to make sure you understand This is a Gaussian distribution With mean zero so it has two means one for x1 one for x2 And I the red points are drawn I ID from the Gaussian which I can because it's a Gaussian And if you sum those up you get those two red lines at the bottom Which converge like one over square root of n multiplied by one and I'm drawing two square standard deviations why one because it's a standard Gaussian with Correlation and in blue you see the same plot for these Marco for these Metropolis Hastings estimates and you see that they are still Really bad and actually a challenge with this is that if you're just looking at the history The histograms of these sort of if you're just looking at statistics of these blue dots Without knowing what the shape of the distribution is and without having the red dots as references It's not entirely obvious how you would notice that this is the case So this week your homework is to implement a Markov chain Monte Carlo algorithm a very specific one for a very specific problem as you may find out and One of the things I hope you're going to try is to Convince yourself that you have actually drawn I ID samples and we'll talk about that in the in the tutorial next week This is not going to be straightforward because you'll actually have a distribution to draw from that is totally unintuitive So here it's simple, but keep it in mind when you're doing the homework. Okay, so Now we have one algorithm that Works kind of but also kind of doesn't work and we're at 1954 by now so now since then a Large number of smart people have in they completely Invested their brains into making these algorithms work. Well, and they made several really great advances to make them work much better Which actually mirror to some degree some design patterns that you've already seen in the simulation lectures So even though we're for many of you Monte Carlo is a repeat of something you've heard before Maybe try and find the connections to the other parts of this lecture course that we've been through before so What who knows what what is your favorite Markov chain Monte Carlo algorithm to use in practice? What's the atom of? MCMC Uh-huh. It's a bit less obvious than for optimization. I'll wait for a moment Yes Yes, so Hamiltonian Monte Carlo or nuts is the statement and so nuts is that is a variant of Hamiltonian Monte Carlo Here is Hamiltonian Monte Carlo actually I'll leave that out So is who has heard of Hamiltonian Monte Carlo? Not that many people. Okay, and those of you raised your hand. Do you think you know how it works? It's always this Okay, so we're gonna talk about Hamiltonian Monte Carlo a because it's the atom of Markov chain Monte Carlo and B because it uses an ODE solver and we spoke about ODE solvers in the past few weeks. So Let's see how they connect to each other So here's the idea. We're trying to draw from some probability distribution P of X Let's imagine that this P of X is can actually be written as What physicists would call a Boltzmann distribution? So that means he can be written as the exponential of minus some function of X this is an extremely Minimal assumption pretty much all ever so slightly we're like Relevant probability distributions are of that form in particular basically all of physics in machine learning. We call these models energy-based models so uh-huh and There's beautiful theory about pretty much how about how pretty much every density is of that form and so on so That's the problem We're trying to draw from and we have access to E of X now what we're going to do is we're actually doubling the amount of Variables we're going to draw so that's like in the ODE filters that you've seen We take the state space of the Markov chain which consists of all the excess and we just for every X we invent a second state Which we call the momentum P, but it's just a fancy word for the time derivative for X dot Okay, and now what we have to say Is how that new part of the state space how that's going to evolve relative to X Because we just invented it so it has no further description yet. It's just there. So what we're going to say is that it evolves according to A function that defines an ordinary differential equation Let me first write down the function and then talk about the ordinary differential equation We're going to come up with some function Which is called H because of this guy William Rowan Hamilton Which has the important property that it that it contains the energy that before we be used and it depends on X And then it uses a new function which we call k for kinetic energy Which only depends on the other type of variable on P and In fact, so the typical choice, but it's actually not the only one you can choose The typical choice is this which is one half times the essentially the square of P So for the for your high school physics, you're thinking of one half MV squared, right? So that's kinetic energy and M is just set to one because it's easy, right? And so if you want if you already heard about this and you're bored by this Think about in your like branch off at this point from the lecture and think about how you do this with relativistic energies You would get relativistic Hamiltonian Monte Carlo and that's a thing you can try that So just imagine what happens to the mass if you want to do it, okay? So and why that might be a good thing to do and why not so okay? But for those of you who haven't heard about this algorithm don't go there yet keep it in your head and instead We're just saying we're gonna use this function That's just our function you just put it in there and the important thing about this is mostly that it only depends on P and not on X and Now we're defining differential equation that uses derivatives of this function H and That is actually an important differential equation that you don't get to change At least not easily. We're going to say that the time derivative of this variable X is given by this The partial derivative of H with respect to P So what is this well E doesn't depend on X K depends on P if we take the derivative with respect to P. We just get P out, right? simple and The time derivative of P is given by minus the partial derivative of H with respect to X and what is that? Well, it's on the next slide, but you can already think about what it might be Just look at this expression and think about the derivative with respect to X's So why is there's an interesting thing to do so the first observation is if we manage to draw From the P of X and P that is defined by e to this function Then because exponentials work the way they do This distribution factorizes into a term that only depends on X and one term that only depends on P and The bit that depends on X is our P of X that we're trying to draw from So this algorithm if it manages to draw from these from this joint distribution over X and P Marginally draws from the distribution over X So if we watch it move around in this state space in X and P and then drop all the P's and only look at the X's We get draws from the actual P. We're trying to draw from so that's a first good property and The second good property might be on a few more slides Because first we need to think about how you actually implement this algorithm. So for that I Have to pop around a little bit in the slides For that we need to solve this differential equation, right and in Normal lectures on probabilistic machine learning. I would now go in for that You need an order e-solver right and there's this magic thing called dopre 5 that you can just call in sci-pi And it doesn't for you or you could implement it yourself And that is algorithms called leapfrog algorithms and physics actually which happen to be Heun's method But you've had this lecture so we can talk about it. How would we actually implement this? Well here? We are so for this here is our equations again from above This is our choice of H we've already found that the rivet that we've decided to model this the ODE which for the X X's time derivative does this which is P and I asked you to think about it yourself So the time derivative of P which is given by minus the partial derivative of H with respect to X is Minus the gradient of E with respect to X So now we have a differential equation which we could feed into an ODE solver But we could also just implement an ODE solver ourselves and Here's the ordinary differential equation with this state space and this right-hand side of the differential equation And now what we're going to use is what's called Heun's method which you've heard about in Natalia's lecture as a Runge-Kutta method of second order So what it does is This here is kind of the butcher tableau hidden in this expression All right, so we start at point i and then first evaluate at h half Times f of the initial point. That's the first line of the butcher tableau and then we evaluate again at One h half well we evaluate at z i plus the Euler step Basically and then take the average of those two values This is as it happens actually a runge-kutta method which has convergence rates order two and you can also implement it Well now so this is the abstract expression right for the runge-kutta method This is how you would implement the runge-kutta method in code But in this case we know what f is because it's a specific ordinary differential equation It's the one that's at the top So we just plug it in and you can write down what the Heun's step is One forward directly So this is the difference between using a software library and actually implementing it yourself by hand If you do that you end up with you know bunch of Python code that looks like this You can look at it after the lecture, but instead we can also go back to this nice visualization Reset and change to what's called Hamiltonian MC And Step so we're at this point again initially and now what we don't see actually is The momentum this is not plotted in this plot But it was initialized to some random variable at the initial point and now this is what the ODE solver did It simulated the dynamics of this Hamiltonian system with which is basically a particle of mass one Moving in an potential that is given by the log which ended with with energy given by the potential energy height given by the logarithm of this shape, right and This is what this ODE solver did. Wow Moved around really a lot and ended up over there And now we reshuffle the momentum set it to a random variable again solve the ODE solver again Whoop and we're at the other point and that's kind of neat Right, so just two steps and we move basically from one end of the distribution to the next That's nice and we move again and we accept so Maybe I can just let this run for a bit so you can see it kind of move around and it produces these nice samples now in a way, we noticed that This looks really cool moves very fast, right But notice that every one of these black dots in these in these kind of snakes Requires one evaluation of the log of P so we should actually maybe compare the efficiency of this algorithm to The corresponding metropolis has things that makes as many evaluations as we have black dots in here But that's actually still better for a reason that we're going to see in a moment so But maybe the thing to notice in this animation is that we need a somewhat constant number of steps to simulate and Actually, we are doing a constant number of steps. Sorry. We don't need it We just set it to a constant number of steps. This is which is set here 37 And here this is the step size so this 0.1 That's the step size of the ODE solver little H if you make it big then the algorithm makes big chunky steps and You can see already that this is somewhat inefficient because it moves and then turns around and comes back You make it really big it goes really crazy And if you make it tiny then it's super precise, but why would we need to do that right because just get really precise Okay, I'll stop this and now why is there's a good thing to do so for this Here is a two interesting observations maybe they are actually Yeah, okay, so if you can do them in any order, but they already given away by this light So the first thing is this is a metropolis hasting scheme that always accepts nice secondly This algorithm makes steps that don't move at a distance that is given by large length scale over short length scale Squared sorry number of steps squared or large can go along scale with length scale square root But instead without the square root so we basically get rid of this problem So instead of this we just get a one here other than a two and we'll see why in a moment. So Or we'll see it here. So first of all If our distribution if it takes sort of small steps if you think of a local point where the distribution is almost a constant function Then this algorithm just moves in the way that a particle with momentum moves Right, it just moves in a basically straight line that doesn't bend because if there are no force forces acting on a particle It just moves in a straight line right Newton's Second law I guess f is m times a So the distance traveled after n steps is given by the Number of steps times the step size essentially of the solver That's a constant Instead of the number of steps and then the square root of it. So we're going to move further and we saw this in the animation The second thing is that this algorithm always accepts and we'll see why so well Basically the point is that the Hamiltonian is conserved. So the way we chose this and This ordinary differential equation The way we wrote it down was a hidden way of saying the time derivative of this function h of x and p is zero and Because that time derivative is zero This joint distribution wherever we evaluate it once we have initialized will always have the same value as the point Where we started so the acceptance probability here will always be one and the algorithm will always accept There's no acceptance probability anymore, but you can also write this down as math So the way we defined this ordinary differential equation means that the total time derivative of this function h and It's given by well or not chain rule, right? With respect to the variable x and the variable P and then we plug in these values up here We notice that they cancel out and we are at zero. So this algorithm draws from a joint distribution over x and p at one constant potential line essentially and Which potential line are we on? Sanity check to see whether you're a follower Does this mean that this algorithm gets somehow magically lands on one potential line and then always stays on that The potential line is chosen by the initial momentum So every time step we randomize the momentum that puts us on some potential line of H Then we do a simulation that gets accepted because we stay on the same potential line in H And then we randomize again move to a different potential line by changing the momentum So in a way, we're moving in H in x and in p space We just don't really see the p because we just randomize it every step And of course while we do that when we randomize the potential sorry the momentum p We after the simulation step Also change the potential line we are in at x of course because we see it moving between different potential lines, right? Okay, and now Well, yeah, actually That's largely what I wanted to say about these algorithms But now Maybe you can just actually look at this the best way to understand this problem It's just to look at this plot for a bit and play around with the parameters So notice how this is this is a bit like in optimization. It's a bit like Adam. It has parameters So Adam has parameters learning rate Epsilon beta one beta two and if you've done deep learning, you know that you have to set those parameters and that's annoying So this algorithm also has two parameters. They're called in this Lingo leapfrog steps Which is the number of steps that the solver the ODE solver is allowed to do and delta T Which is given by which is the step size of our Heun's method H So we can set those somehow and if we set them badly the algorithm doesn't work Well, so if we set the number of steps to a very large value Then you can see it move out turn around and come back and That's annoying because it's waste computational resources right we start the simulation and then it it kind of spends most of the Time going back and forth without actually moving So the overall distance traveled after the simulation is much less than what the simulation did during the run so you kind of want the simulation and also this Delta T chosen such that the simulation kind of starts moves in a direction and when it starts turning around You kind of want to stop right? That's the intuition you would have so that's what's called a u-turn That's where the no u-turn sampler comes from So you could come up with an algorithm that basically so imagine maybe try this for yourself, right? So how would you write such an algorithm? How would you write a subroutine of this algorithm that notices when it has turned around? I'll let you just think about this for 10 seconds So first of all you may have noticed that it's already a bit tricky to think about this because of course the space might not be two-dimensional Might be very high-dimensional. So if you're drawing from a thousand dimensional distribution, it's a bit harder to think about what it means Even to turn around But you could you know, you could measure Euclidean distance To the point where you started and just watch it as a summary statistic go up And then when it starts dropping again, you're done The problem is when you do that you break these held balance You're building an algorithm that is guaranteed to move away and not come back, right? Yes the momentum Doesn't Ah, and the momentum is zero. Ah, you mean you move up you move up the hill and then when it sort of turns around But it went to go zero and yes, and that again, I think would break these held balance so the one way to do it which is Like the one of the algorithms many people use is called the new no U-turn sampler nuts Which we can actually which is best explained by just looking at a sim at a visualization So what this algorithm does is it actually starts to Markov chains in opposite directions and Waits until one of them starts turning around and so that means it it gets Closer to it's the other one that is running away and then draws a random number to decide which of them to use and It somehow turns out and that's like a page in the paper that that actually satisfies detailed balance and quite often Actually when Markov chain Monte Carlo methods are defined. This is what the entire game comes down to So someone has this nice intuition for what should actually work Then they realize that that doesn't quite keep detailed balance So they introduce more randomness to make it satisfy detailed balance again And pretty much any algorithm in the Markov chain Monte Carlo or Metropolis Hastings mold comes down to this Also pretty much like half of these algorithms are invented by one guy Bradford Neil and this one isn't like as one But the one you're going to implement in your homework actually is it's a variant of slice sampling which comes from Bradford Okay, so this algorithm is maybe an example of One of these numerical routines where which is kind of analogous to what we've heard previously for ODE's and for partial differential equations If all linear algebra and so on where at some point things just get so tricky that it's good that someone did a PhD on it Worked it out wrote it down packaged it and left it to you because this is the kind of thing You might not want to implement anymore But it's still useful to understand what it does and why it does certain things in a certain way from the randomness to begin With and the behavior of Markov chain Monte Carlo to why things are chosen in a certain way because now at least you can still Understand what that step size actually does so if someone tells you that you have to work on that Then maybe you can think about how you would set it By the way one way to think about this step size is to think about the second derivative of this function and Get let it give you a local scale for how far you're going to step This is sort of similar to how in optimization methods you might use curvature as well and Well good point I actually don't know so look it up. Just read the original paper if you want to There is a way so there are various actually in this visualization you can check various implementations of this argument is one that somehow adapts Certain things another way of doing of adapting the hyper parameters of Hamiltonian Monte Carlo is not to use these two different simulations, but instead using Hessian evaluations so curvatures To get a sense of the local scale of the distribution and then decide how far you're going to step and You can see this this simulation here. So there are various other versions, but at this point just like there are many different You know ODE solvers, you don't have to know all of them. It's just good to know at least one decent one Okay, so that was the textbook version of the Markov chair Monte Carlo lecture So now if you've heard it before you've been reminded if you haven't heard it before you've had it now and this is the knee-jerk approach to Computing expected values in Bayesian inference if you really don't know what else to do So if someone gives you a probabilistic model a probabilistic program so a bunch of variables depending on each other where The initialization of the random of the of some variables is just a random variable and you want to compute postivius You apply Markov chair Monte Carlo to draw asymptotically from this joint distribution and then compute expected values of that distribution by basically computing sums of function evaluations and The arguments I've given you in favor of these algorithms are the ones that you usually hear in lectures on this topic But there are also Reasons why those algorithms are maybe not the best thing to do We've already come across a few of them. The first one was that this convergence rate seems kind of low so even simple Monte Carlo even the Even if we can't draw random numbers from P Independently of each other which we can't do with Markov chair Monte Carlo But with simple Monte Carlo we could and Markov chair Monte Carlo only approximates them even then We are only getting a convergence rate of square root number of samples and the statisticians I said at the beginning would tell us that that's the best thing you can hope for because For unbiased estimators. There is no better rate than square root of number of samples So if you want to get out of this We have to question whether we even need an unbiased estimator so and Here it's maybe good to kind of take take a step back and think more from the perspective of a numerical Algorithms for machine learning lecture. I think why is there a random number in this computation in the first place? So unbiasedness is a property of random numbers, right? It's not a property of integrals the integral is not unbiased the integral is just a number But Monte Carlo methods use random variables. Why is there a random variable in our computation to begin with? And when you think about it, there is really no reason why the only reason there is is that it makes our us our our Our estimator unbiased, but we can only talk about unbiasedness once it's a random variable, right? It's a property of random variables So there's a weird philosophical jump to say, huh I'm going to use a random number. So what's the best number I can compute? Well, it's an unbiased one right if it what's the best estimate was the unbiased estimate That's one property of estimates. It should they should be unbiased But why are you computing an estimate in the first place? Maybe there's another way of approximating this quantity we care about in it that has nothing to do with randomness and That's where we maybe notice that what I showed you today for the entire lecture didn't contain a single actual random number at all Because what is actually a random number after my entire career? I still don't know to be honest actually so Whenever I start thinking about random numbers. I realized that all the numbers I can think of are maybe not random So let's think about a few random numbers. I wrote down a few Please don't check in the slides if you've downloaded them because I'm gonna give away the answers Here's random numbers five of them Which of them is random so maybe first of all I should have asked you is this number random, right? and the answer is maybe it just gets you to the point where where Well randomness is a property of sequences, right? It's not actually a single digit a digit cannot is not random because you don't know what the corresponding structure is You have to look at sequences, but sequences are always finite sequences computed on a Turing machine are always finite So here are a bunch of finite sequences so Let's see. Do you think that this one is random? I mean you've all seen the pattern, right? So what's up with that? There's a bunch of repetitions, right? So what I actually did here is I took dice Which are like your philosophical device to make randomness and through them and then I took the Yeah, and then just wrote them down twice, right? I just wrote on the answer twice That's it but not even more I think I thought of some cool way to produce longer varying lengths of repetitions But it's really just whatever came comes up written down twice Is that random? So it's kind of random and not random, right? Because you can't predict if I if I give you The first digit you can predict the second digit, but not the third And if I dropped you randomly somewhere in that sequence Showed you what the current number is and then asked you what's the next number. What would your answer be? The same right because you're half with probability 50% you're at the beginning of a pair So there's a high chance that the next one is actually the same one Okay, so that sequence is maybe random because I use dice and dice is sort of everyone accepts Yeah, they are random. We've sort of culturally accepted that they are random And then I double but the doubling is kind of weird, right? So what about the second line? Someone recognized it very famous sequence Havin is recognizing them. So there are digits of pi, but they're not the first ones It's the 41st to the 7th 17th, but there's always someone in the room who knows the first 100 digits of pi, right? So I've directly moved it a little bit further. Is this a random sequence? You're shaking your head so why not? Why is it not random? If I dropped you randomly somewhere in pi Could you predict the next point the next digit? No Because there's no structure to the sequence, right? It's an irrational number. There's no patterns Interesting point. So somehow pi is weirdly finite and infinite at the same time, right? Because we only know a finite number of digits of pi a lot of them, but only a finite number So actually I don't know whether there is an algorithm Yeah, okay, so there's probably really no way to get to an arbitrary point in a sequence of pi Maybe there is I'm not a number theorist. Maybe there's some weird way of getting there and somehow randomizing some machine, but So clearly, I mean, this is like the most deterministic quantity ever, right? It was it was studied by like the ancient Greeks 2,000 years ago 2 and a half thousand years ago and found to be somehow absolutely fundamental and like one of the most Important objects maybe in our universe So it's not random, right, but it has no structure So maybe that's something we want about a random number sequence hmm Okay, so the other the other just the next Line in this is actually also it's just a safety net for me So sometimes there's someone who knows digits of pi. It's like I did just apply So here is the 1,560 first to 1,590th digit of the golden ratio. It's exactly the same thing It's just a bit less known, but it's just as reliable, right? Just as deterministic so the next line is I forgot let me check Oh, yeah, that's actually the kind of random number that I've used for the whole day It's a random number generator generated random number using a particular outdated method the phenomenon method and with a particular seed. So this sequence of numbers is Maybe it's random. It's random in the sense that we use the word random number Because those are exactly the kind of numbers we use in computations when we use Markov chain Monte Carlo But notice how if I wouldn't have told you the seed it would have seen random and the moment I tell you the seed It's not random anymore So there's like this private key. I have that's called my random seed and Once I reveal it to you the algorithm is not IID random draws anymore, and here's actually I can tell you the last one as well so These are digits from a CD containing random numbers So back in the day when there was no internet and it was hard to get random numbers There were actually machines that produce random numbers physical ones that use typically radioactive decay to produce random bits And you could buy them and there were different machines from different producers across the world that you could buy and there was a guy called George Masalia in the US who? did a famous paper checking them for randomness so he bought a bunch of these machines and then Random produced random bits and checked whether they are random. So you use various kinds of tests for randomness Like basically checking for patterns So he used the random numbers produced by these machines and then turned them into various random numbers you can think about like for example occurrences of birthdays in the calendar and then he checked whether Co-occurrences if you produced a lot of different birthdays where the joint birthdays happened at the right rate according to the birthday paradox and Then built like 20 odd of these tests. They were called the die-hard tests of randomness. That's what the paper was called He found that all of these machines were not actually random correctly They all failed those tests even though they were physical and supposedly totally random and then he basically did a Parity of all of them like sort of change them together and computer the parity of their bits and then release those on a CD So he like stored them and then burned them on a CD and he would send those CDs to people who really needed random numbers For you know safety critical applications. So I at when I did my Undergraduate degree you could already download the CD. That was good that we had internet So I downloaded it and here are a bunch of these bits They are about as random as it gets like they come from really like the thing that banks used to produce random bits But they are stored on a CD. So Once I tell you where they are they're not random anymore So what this is supposed to get across is that randomness really has something to do with lack of information It's with about something you don't know about patterns. You don't know about points in a deterministic sequence. You don't know At least that's the case for all of the randomness that we use in contemporary machine learning So there are Kinds of applications of random numbers what is lack of information of Knowledge about where you are in a random sequence actually matters and those are all the ones where it's all about Someone knowing something and some other person not knowing something. So cryptography basically Cryptographic hashes use this property and for them they typically use a random number generator But then the seed of the random number generator is the important bit basically that that's like your private key that you keep hidden away but For the kind of random numbers we've used today to compute integrals It seems totally misguided to talk about this lack of information So the only way I mean if I wanted to phrase this in what's actually going on when you use a Markov chain Monte Carlo method is Imagine you're doing you're writing a paper about some Bayesian inference on some model and Maybe some you know real-world data and you're doing Bayesian inference You want to do sort of you want to get a good estimate for things? So you write in your paper. I'm gonna use Markov chain Monte Carlo when you do that what you're actually telling your reviewer is I Produce a bunch of numbers, but I'm not telling you how All right, I'm only I was doing it in a certain way, but I'm not telling you how why because if I'm not telling you Then the estimate is unbiased But the moment I tell you the moment I tell you the seed it's not unbiased anymore So if someone actually uses the same seed all the time and you know the seed You kind of shouldn't accept the paper anymore, right because it's not unbiased But if you don't know the seed and you don't know how they set them Maybe it's okay. Now, of course, they are all these nice studies about you know You can there's there was a something on on Twitter when it was still cool a while ago I have someone doing a big big table ball from github To check for people setting random seeds in code on github and then found lots of spikes and frequencies So when people set their seeds to 0 to 42 to 69 to 420 whatever And and they don't set it to actual random numbers They also set it to like 1 2 3 4 5 6 7 8 9 or like whatever right or things that things that just happen If you put your hands on your keyboard and go from left to right like then you get these kind of weird patterns and so When people use Markov chain Monte Carlo all the arguments they essentially make for their computation Come down to the fact that they are hiding information from you And that's actually all of it all the other arguments for Monte Carlo come down to this property that they are unbiased So when you're using Monte Carlo, you're kind of accepting That you're using paying this extremely high price for the algorithms that they converge at best with a square root fast rate Which seems really bad, right and next week will find algorithms that converge polynomially or even exponentially fast and The only reason you're willing to pay this price Is that there is this beautiful theorem at the beginning that says the resulting estimate is unbiased? But it's just unbiased because I'm using random numbers, right? I'm approximating something that is just a number. It has nothing to do with randomness I'm completely arbitrarily injecting the idea of randomness into this problem Then making an argument about how it's unbiased relative to the randomness I just injected that only holds up until I tell you how I made the randomness Because the moment I reveal how my random number generator works The argument is actually physiologically not not valid anymore. So here's the real reason why people want to use Markov chair Monte Carlo It's because it's easy to use It's easy to implement and you can just start it and walk away from the machine and have to computer do the Difficult job for you It's tends to you tend to just have to call a method that Excesses the value of your probability distribution your P of X or your log of P of X Or actually just P of X up until a normalization constant Which is essentially the same as except as accessing the energy function and and then it just runs and It happens to work reasonably well in problems of Relatively high dimensionality and I say this very carefully in this way because That's another thing that when you really check When people use Markov chair Monte Carlo in high dimensional settings We rarely ever actually get to hear or get to prove that it works well It's just that there is no other algorithm to compare to To check whether it works well or not So we kind of end up relying on Markov chair Monte Carlo because it's the only thing for which we have theoretical guarantees Which rely on the fact that we're using random numbers, which we don't actually have but maybe they are random enough to kind of accept the mathematical argument But because we don't know the convergence rates for my former mark of chair Monte Carlo we kind of Just have to Accept that the samples hopefully are useful that's a little bit unfair because there are Ways of producing interesting summary statistics to check whether your mark of chair Monte Carlo estimate is Mixing and we'll talk about some of those in next week's tutorial But they are not hard guarantees. They're just tools like software development tools basically sanity checks to check whether the algorithm seems to be doing its thing well and We are Accepting when we do this when we basically are willing when we use mark of chair Monte Carlo to accept and Fundamentally very slow and painful laborious Algorithm just because it makes it easy for us and hard on the computer And that's fine quite often because human labor might be much more expensive than the electricity needed to run the computer But for some settings and will find more in after Christmas basically It's absolutely not a smart idea to use mark of chair Monte Carlo It's much better to think about other ways of approximating integrals And what we will then notice is that those integrals of course these methods will always have will always be Approximations as well. It's just going to be that the way in which they are approximate will be more structured So you could think of this as a form of bias, right? There will be a certain ways in which they always tend to go wrong Which you can kind of guess but not perfectly guess because otherwise you could correct for them Which are not easily findable in mark of chair Monte Carlo But that's really more to do with the fact that MCMC is so opaque than with the fact that these other methods are bad so Mark of chair Monte Carlo methods are only correct if you want them forever and nothing ever runs forever and They are not actually random. So they're not even correct because no one actually uses random numbers case in point if you've ever used A stable diffusion recently to make some beautiful pictures like I do all the time now from here for my slides Of course you set the random seed to a fixed value because you want to get the same picture back the next time You run the algorithm and they are all these websites of people You know posting their prompts for these algorithms together with the random seed that you have to set to get the image that you want to see So what happens on those websites is people say use this random algorithm in this Deterministic question to produce this thing that I found I've been rummaging around in this sea of randomness And if you fix the randomness in a certain way such that it's not random anymore. It does this interesting thing I'm not sure that's a property that we want to keep for Machine learning for the long-term future. So with that I'm at the end for today Next week. We're going to talk about the we are going to approach the integration problem from the other direction. We'll look at how fast an integral could possibly be computed and what that has to do with inference and with uncertainty quantification and The price will have to pay then I can already tell you now is that the algorithms will look at will typically only work for low-dimensional Problems and in high dimensions. They are if you are lucky just as good as Markov Jay Monte Carlo So they won't actually break. They'll just be a little bit like Markov Jay Monte Carlo And then we might as well use Markov Jay Monte Carlo because it's so easy to use. Okay. Thank you very much