 Welcome back to probabilistic machine learning lecture number four. Here's where we are in the course. In the previous three lectures we found in lecture one that probabilities provide a mathematical formalism to reason about quantities that aren't fully identified by data. And that leads to the rules of probabilistic reasoning, some rule, product rule and base theorem. Lecture two, we then discovered that one of the paint points of this process is that it can be computationally very hard because we have to track a potentially combinatorially large number of hypotheses and we discovered the notion of conditional independence as a partial remedy for this problem, which allows us to separate parts of the problem during the reasoning process. Last lecture, three, we then extended the notion of probabilities from discrete spaces to continuous domains and saw that there things are actually almost the same as long as we are able, and this is not always the case, to define an object called the probability density function, which distributes truth over an input domain. BDFs follow essentially the same rules of probability, some rule, product rule, base theorem. The only thing we have to be careful about is that if we change the definition of the input domain, or if we map from one input domain to another, to another random variable, then we have to be careful about the transformation. Today I'd like to talk with you a little bit about computation. So how do we actually implement probabilistic models on a computer? Of course, this is going to be a topic for the entire term, but it's now time to get a little bit into it. So what are the kind of computations that we have to actually face when we do probabilistic machine learning? So in other interpretations of machine learning, which we will also talk about of course over this course, and which you can also follow along in the development of the parallel lecture on statistical machine learning by my colleague Riker von Luxburg, you will typically, like as the core computation, encounter optimization problems, finding the best explanation of one product, the best hypothesis within a space of hypotheses that explains the data best. In probabilistic reasoning, because we have to keep track not just of one hypothesis, but of the entire set of possible explanations, at least in principle, the corresponding computation is not optimization, finding the best explanation, but integration, finding the correct volume of the space we care about and the geometry of the space of hypotheses, or actually the probability distribution over that space. So what are those kind of integrals? So we already encountered one last time in lecture number three, just to remind you briefly, this was this example where I tried to infer the proportion of people, the probability for someone to wear glasses in the population. To do that, we introduced a generative model that had this probability pi as the latent variable, and we could observe individuals xi, and we assumed that each of these were independent, so that's a use of conditional independence, and then assigned actual values to this likelihood, this is this Bernoulli probability for each individual to wear glasses or not, and we wanted to do Bayesian inference, and Bayesian inference just uses the theorem of Bayes, so the posterior is equal to the prior times the likelihood divided by the evidence, and the evidence is one of these integrals. It's the integral over all possible values of this latent quantity pi for the data given pi. So, actually this integral was already came up in Laplace's original argument when he introduced this kind of notion, and he didn't really know how to solve it, so he used an approximation for it. He already faced this computational issue of having to compute something, the volume of a hypothesis space, that is not easily written down. Now, this kind of integration problem will show up all over the place, it's the bane of probabilistic reasoning, and we will have to find lots and lots of tools to deal with these complicated integrals that show up. What kind of integrals are these? So, a very useful notion to just briefly define here is something that you've surely seen before, and if you haven't then don't worry, we're just going to do it in three minutes now, and that's the notion of an expectation. So, I'm going to use two different kind of notations, because I'm sometimes a little bit sloppy and I might swap back and forward between them. I'll either write something like this, or something like this, to express something that I want to call the expectation of a function f under the probability distribution p, to mean this integral. So, we will typically assume that p has a density, that's this thing, and we now consider some function f, and we care about this kind of integral. This is a general form of which you've certainly seen many examples. So, for example, f could be just the linear function, then this is known as the mean of the distribution. And of course, you've seen that before, and if you haven't, then now you've seen it. Another interesting quantity is the variance of a distribution that's the expectation of this function, which is the function x minus the expectation of x. So, here I've already somewhat sloppy with the notation. Here in x, I mean the function that returns x. So, exactly this up here, so the mean. And this quantity, let me actually write something on the board, can also be rewritten, and many of you will have seen this before. It's useful to remind you, though, that this quantity, which is called the variance, can also be written as, so we care about the expectation of under p of x minus the expectation under p of x squared. This is equal to the integral over x. Now, I'll use the binomial formula that I've actually expanded. x squared minus x times the expected value under x plus, there's a two missing, the expected value under p of x squared dp of x. Now, the kind of argument that we will typically make, and if you use the integrals, then this is just sort of very straightforward for you. If you have to get used to integrals again, because maybe your education is more focused on discrete math, then you will quickly get used to this. Think of an integral essentially as a sum, and then check where the quantity that actually we want to somehow chose up. And you'll notice that, so we can now take the take apart this integral into three terms. Here is an expectation over x squared d, and I'll also sometimes write something like this, which is given that there is a density the same as p of x times dx minus two times, and I'll notice that this is not a function of x, it's just an expectation under x. So it's an integral itself over a different x, so we can take this out. And then what's left is actually another integral over x dp x. So that's just the square of this, because that's exactly the definition of this quantity. And now we have plus the same thing squared again. I'll use, let's put it like this, so not use too many simplifications of notation by the way. These two are the same, so one of them cancels, and we're left with the expectation of x squared under p minus the expectation under p of x. So this is another definition of the variance, either it's the expected square distance to the mean, or it's the difference between the expected square and the square mean. We will actually use this in a computation in a moment. Mean and variance are the first two, or the first and the second moment of a distribution. So moment is another word for these kind of quantities, expectations of polynomials. This is called the non-central moment, this is called a central moment because we were subtracting the previous moments. All of this will just come up sort of along the side, but it's not particularly important. There are other quantities people care about, like for example expected values of the negative logarithm. This is called the entropy, at this moment we don't have to talk about why and what this is. We will mention it when it comes up. The reason these quantities are important is that they provide a geometric interpretation or a structural kind of estimate of the shape of this probability distribution. And by the way, the evidence up here is itself one of these expectations. It's an expectation under p of the likelihood function. So it's a kind of expected likelihood under the probability for pi, which is actually what this is. It's an expectation for how well the model explains the data that we actually got to see. So we will need to use these kind of quantities to do all sorts of interesting things, and that means one of our key challenges on a computer will be to solve these kind of integrals. And of course you could hope that if you're really careful, then these probability distributions p and these functions f are such that we can always solve these integrals. Maybe you have a big book of integrals somewhere lying around on your bookshelf, but that's not always going to be the case. And in fact, maybe very early probabilistic reasoning worked under this kind of premise, but even Laplace already struggled over 200 years ago solving a relatively simple integral like this one. And it took really hard work by Leonhard Euler to come up with these kind of answers. So to solve this problem, we need general algorithms that do this for us. And today we will encounter one of them, and I will add it to this kind of slide that I'm going to show you over and over again. This slide will be a feature over the course of the entire course until the very end, and I will use it to plug the most important tools that we're going to use into one joint picture. I want you to think of this slide as the toolbox of probabilistic machine learning. If you are a master craftsman, you might have one of these wonderful toolboxes that contain everything you need to do your job. When you open it up, there are different compartments for different tools you need. Maybe there's like a big hammer on top that you always need to use, and then at the bottom door there are more complicated, finely precise tools for very specific tasks that really simplify one task and are basically useless for everything else. In this slide I'm trying to construct something like this, because if you want to be a master inference engineer, a master machine learning engineer, then you also need this kind of toolbox for your use. And what I will do is this thing has a structure at the very top. There is the hammer that you need for everything. This is the framework under which we will do probabilistic reasoning. And actually after the first three lectures, this first part of our toolbox is already finished. It just consists of the laws of probability. So to get rid of one variable in your reasoning problem, you integrate it out. To reason about one variable given the other, you use the conditional distribution, which directly leads to Bayes' theorem. With these three, we have everything we need to think about what we actually need to do to solve an inference problem. Now, that's just an abstract notion. So when we actually face a concrete problem with data and latent quantities, then we have to do two things. We have to build models, generative models, and then we have to actually solve the computational problem. So on the left hand side, I will draw up tools that we will come up with to build models. And on the right hand side, I will collect algorithms that allow us to reason in these models. The separation between these two is not perfect. There will sometimes be a concept coming up that doesn't perfectly fit on one side or the other, but it's reasonably good. And I've already entered one kind of tool that we discovered in lecture two, directed graphical models, which are provided a way to build models and reason about conditional independence in them, directly by drawing them, and they are also visually very helpful to think about the structure of your problem. Today, we'll find our first computational tool, and it's going to be called the Monte Carlo method. You've probably heard about it, so let me tell you about this. This goes back to a very simple idea, which is that if you want to integrate such a complicated function, or compute this expected value, then you can do so by drawing random numbers that have the distribution defined by p of x. And we will talk at the end of this lecture more about what this actually means, but at this point you probably have an intuition for what this means. It's random numbers which show up with a frequency that such that their density approaches over time, as you get more and more of them exactly this density function. To compute this integral, you can approximate it by drawing such random numbers, and for every individual random number, so we get an individual xi, x1, x2, x3, x4 and so on, we draw them all independently from each other such that their density is given by p. Then you can approximate this integral by evaluating for every single sample the function we care about, summing up all these values and dividing by the number of samples. This algorithm or algorithms of this form are called Monte Carlo methods. So this goes back to this place, which is a famous casino in the French Riviera town of Monaco, where the father of one of the inventors of the Monte Carlo method apparently used to gamble a fancy place where you can lose a lot of money. Why is that? Because these algorithms can be thought of as playing randomized games and observing their structure or frequencies that show up in them. This actually might be motivated by the original way that this apparently was invented. The name of these three people is connected to these kind of algorithms. This is Stanisław Ulam, I think his name is right to be pronounced. He's a Polish mathematician who emigrated to the US on the eve of the Second World War. Nicholas Metropolis, a Greek mathematician who also came or lived in the States. And of course, the famous John von Neumann, one of the most famous computer scientists of all time, a Hungarian mathematician who also emigrated to the US. So three Europeans holed up in the US during the Second World War, working on nuclear weapons in the Manhattan Project. And apparently the story goes Ulam was recovering from an illness or a sickness in hospital. He had to spend some time and he was playing cards with himself solid to our games and was wondering about how frequently these solitaire games are actually solvable. What's the probability of one of these solitaires to be solvable? He discovered it was quite hard to do this mathematically because it's an expectation under a probability distribution that is relatively complicated and then realized that instead of trying to do the computation he could also just do, let's say, a hundred of these games and the frequency with which he could solve them would already give him an intuition and a first estimate for how frequently this success actually happened. So of course you can do that with any other game of chance as well. And then potentially because of this, Metropolis suggested the name Monte Carlo Methods to give it a bit of a grander name of notion. One of the first ways these algorithms were actually used concretely in a real, let's call it a scientific problem, if you like, was apparently to construct densities of neutrons in geometries for nuclear weapons. So one of the, well, maybe the biggest problem in designing a nuclear weapon is that you need to come up with a geometry of how to distribute the reactants inside the chamber of the device such that when compressed they are, such that when, well, when you build it the density of neutrons doesn't become too high. So there's no chain reaction and then when you compress it during the explosion the density of neutrons arises such that there's a massive chain reaction and the whole thing blows up. So building these devices in this way was apparently not so straightforward and especially if you don't have a computer at hand. So what these guys came up with is, so this is a device that was actually built by Enrico Fermi, a later physics Nobel laureate, based on maybe a conversation with Ulam. Here he is again, now without the funny, you know, Second World War attire. This is called the Fermiak, the Fermi analog computer. It's a little device that you can drag over a piece of paper to construct paths of neutrons in a nuclear weapon. So what the way this actually works, I think it's got a funny, funny little historical side anecdote, is you draw random numbers, oh, how do you draw random numbers? Well, thankfully you're working with John von Neumann and he comes up with a very, very crude but efficient way of constructing random numbers mechanically on a computer, look up the von Neumann method if you'd like to know how this actually works, and then start at the random place, imagine you're a neutron, and then use random numbers to compute probabilities for how long, given a particular material, this neutron is going to travel before it hits another particle. These numbers were actually known, they were measured by chemists and physicists. So depending on which material you're in, you're setting one of these wheels to the correct density and then this wheel turns as you're dragging this card along and you measure what the effective distance traveled in this material is. It also depends on whether you are fast or a slow neutron, for that there is a binary switch on this one and there's a little pen mounted underneath here that draws this line while the device moves forward until the number you see up here reaches the random number you've just drawn and then you redraw another number and a new angle, turn the device to the correct angle and do this continued drawing. If you keep doing this then I guess over time you get some kind of density of lines which was the point of this experiment and then you could get a feeling for what the distribution of neutrons is going to be in their geometry. Now you look at that, you get a feeling for what's wrong, you redo the geometry and then have to do this whole complicated business again. Over time you might then learn how to build a nuclear weapon and in fact that's actually apparently what happened because one of the first successful designs for nuclear weapons was called the Tella Ulam design so this guy clearly had an interesting influence on this kind of development. So why is this a good idea to solve computational problems in exactly this way? Now of course it might be a good idea because not sure whether it was a good thing that nuclear weapons were developed but clearly it solved a computational issue but maybe we can think about this a little bit more mathematically as well. Before that I want to give you one more example of how to actually do this. So far I've waved my hands about it a little bit and said here is how you use random numbers to do some computations and here is what this estimate looks like. Now let's see what I actually mean by this and how I actually do this in practice. So let's say, and this is one of the typical teaching examples for using Monte Carlo methods let's say we wanted to compute the number pi. Of course we know what pi is, it's 3.141 and so on some of you will know the first 100 digits of that number but let's say you don't and you want to estimate it because it's a complicated thing, right? It's kind of this irrational number. So one way to do so that isn't using an analytic computation is to notice that, so pi is this thing that number that corresponds to the radius of the unit circle. Sorry it corresponds to the, not the radius, the circumference of the unit circle. So here is a fraction of a unit circle, this is a radius of 1, this is one quarter of a circle. So the area under this quarter circle is pi over 4. Now one way to compute pi is let's say we have access to random numbers that are uniformly distributed over this square. So each dot or circle in this plot is a random number drawn independently of each other. Then the ratio between the numbers that are inside of the square and all of the numbers is, well, pi over 4 because the area under this quarter circle is pi over 4 and the area inside of this unit square is just 1. So that means we can compute this fraction and therefore this number pi by drawing such random numbers just checking which of them lie inside and which one lie outside of the circle, computing the ratio between inside and all, multiplying by 4 and that's it, now we should get pi. So how do we actually do this in practice? I've actually copied in the code here for you to look at later but let me actually show you the corresponding code, so I've already ran it once. Here is a simple, maybe this is the simplest piece of code we'll look at all summer. It's a very, very simple non-pi notebook. So I'm introducing a function that produces random numbers and maybe you notice that that's actually where all the magic happens. So there is this magic thing that produces random numbers. We'll talk about that later in the lecture but let's say we have this thing that produces uniform random numbers. So this function produces random numbers that lie between 0 and 1 and now we're going to produce a certain number of them, let's say 100 and so here's what I do, I produce actually 200 random numbers. So for 100 experiments I produce two uniform random numbers so that gives me an x and a y-coordinate and a unit square. Then for each of these, this pair of x and y-coordinate I square them and take their sum, so that means I now have x-square plus y-square. That's the square radius of the corresponding circle on which these points lie. We want to know whether that radius is larger than 1 or less than 1 whether they are inside or outside of the unit circle. So it could take the square root but it doesn't matter because we just want to check whether it's less than 1. So let's not take the square root because the square root of 1 is just 1. So having taken their sum of their squares I check whether they are less than 1 and then take the sum of all of this stuff. So that's the average, that's the actual Monte Carlo computation. So this is the point where we compute this quantity. So here f is our function that says are you inside or outside of the circle? Now we sum them all up and normalize. You could think of this as computing what's known as an empirical mean but that's just what that is. So we take the sum, divide by the number of samples and now this actually is an estimate for pi over 4. We want to know pi because you know what pi is, you don't know what pi over 4 is. So I just multiply by 4 so now we get something that should look like pi. So if I ran this code, let me do this. Then if I run it once we get 3.12, if I run it again we get again 3.14, now 2.96. So why do I sometimes get the same number twice by the way? Think about it for a second. It's because there's only a hundred samples. And so it's actually not that unlikely that every now and then we just get the same fraction. So if I keep doing this then you will notice that we get out numbers that are sometimes above, sometimes below pi and they're actually quite far off from pi but they're roughly in the right ballpark. So what I can do is I can increase that number. Let's say we don't draw a hundred samples but a thousand then the code will still run almost instantly and the numbers get ever so slightly closer to pi. Now let's increase the order of magnitude of samples once again. Now we get numbers that are not that bad. They're at least correct in the first two significant digits, 3.1. This one is even correct in the first three, 3.14. But of course that's not true for all of them. Here's one that's wrong in the third significant digit. Now let's go to a million. This will already take a little bit of time to compute on this machine. But this actually looks pretty good. Those are really good first few digits. Let's try it again. At least the first three are now right. So the first three digits seem to be roughly right with a million samples. Interesting. So as we increase the number of samples this computation gets somehow more precise. Let's think about in a moment how much more precise it gets. This gives an estimate for the quantity we care about. Now of course we know what pi is. But under more complicated distributions, like for example the set of all solitaire cards, or card decks you might want to try and solve, or the design of a nuclear weapon, it's not so straightforward to know what the right answer is. And this kind of computation allows you in a, let's say sort of a forward way, by drawing random numbers, get evaluations of complicated integrals completely without knowing anything about integration. You just replace an integral with a sum. Now of course this can be very useful because integration is a hard task. Okay, before we come to our first gray slide, let's think about why this is a method that works and why it's actually useful. So here is the basic argument for why Monte Carlo methods are useful tools. It's a very classic old argument and it's pretty straightforward, so we can just do it on one slide, actually on two slides. And it works as follows. Let's say, or not, let's say here is the thing we want to compute. This is our expected value. This is the quantity we care about. Notice that this is a deterministic quantity. There's no randomness here. It's just an expected value of some function under some probability density function or corresponding probability distribution and measure. You might think that that has something to do with randomness, but it doesn't. It's just an integral of one function against a measure. What we're going to do instead of this integral, which we don't know how to solve, is to compute this quantity, which is a random number. It's a number that is computed by drawing random numbers from P independent and identically distributed. What that means is that we draw a sequence XS for S from 1 to S and S to capital S. And we all draw them from P. So that means A, they are independent of each other. So for any 2, S and T, their joint is equal to the product of their individual distributions and each distribution is given by P. That's what IID means, just to remind you. When we draw these random numbers, for every single random number we evaluate the function we care about. That function might be f of X is equal to X or f of X is equal to X squared X to the P or log of X or anything else. Then when we've done so, we sum up all these numbers and we divide by the number of samples. This sum is now a random number because it depends on these random numbers and every time we run it, just like I just ran our code over and over again, we get out a different random number. Why is this a good random number? Well, one interesting thing we can show about it is that the expectation of this random number itself is actually equal to the number we care about. So here's what I mean by that. Let's think about the expected value of this random number, of this phi hat. By expected value, I mean the expected value under the distribution that is implied by this generative process over the individual samples. What is that? Well, it's an integral because of the expected value over the function we compute, copied in from above, times the entire joint distribution over all of them. But we know that that joint distribution just factorizes into individual terms from X1 to X capital S. So we can just write this directly, these individual P terms in the sum already. I've basically already done the first step here for you. Now notice that integrals and sums permute at least in this simple case because of this sort of simple regularity structure. And now we see, ah, okay. So what's in here at the end of the sum is actually an expected value. It's an expected value of F under P. And whether we call that XS or X1 or X2 or X3, it doesn't matter because they are all distributed in the same way. They all have the same P. So even under our abuse of notation, we're allowed to do that because all of these P's are the same. So this is just phi, right? So what we have here is a sum over S individual equal copies of phi. And then divide by S. So that's S copies of phi divided by S. So S times phi over S, that's just phi. So the expected value of phi hat, our random number, is actually the number we want to compute. So this is clearly something you might want. It means that this random number will on average be actually the number we care about. That's great. This kind of property is called unbiasedness. This means that, or this is writing this word, means exactly this. So this means that the expected value of this random number, which is an estimate of what a quantity we care about, is equal to the number we care about. Now, we'll have to talk about this word unbiased again because it's unfortunately a very loaded term. So bias is something that's sort of emotional, right? And being unbiased sounds really good. But that's just a mathematical word for it. So try not to be emotionally affected by the use of the word unbiased. It really just means this. It means that the expected value is equal to the correct value. So that's good, right? Because it means that this number we are computing might not be all that bad because on average it's actually the right number we are computing. Now, of course, any individual instantiation of our samples of our Monte Carlo estimator are not going to be equal to 5. And you just saw that in the code, right? So let me just run that again so you understand what I mean. So if I keep doing this, right, we will just keep getting more and more numbers and not a single one of them, actually. They may reduce this a little bit again. So we can... Not a single one of them will produce the correct number of pi. But on average they're going to be right. So how far away from the tooth are they going to be? Well, for that we need to compute another expected value. And one expected value that might be quite interesting is the variance, the expected square distance from this thing we care about. So we just saw that the variance, let's compute it, is the expected distance of the... or square distance of our random number from the value we want it to have, which happens to be the expected value of the random number. So let's compute that. Well, for this we just plug in what the value of this function actually is. So how exactly we are computing it. For the expected value of phi hat I already plug in what we know it to be. It's fine. That was from the previous slide. And then I just plug in the definition of the estimator. And now you notice that I've already dragged phi inside of the sum. So it was previously outside and I multiplied it with s and divided by s and put it into the sum. So that's easier to do. Now let's take the square of this. That's the square from over here. And to do that square, let's be careful. We actually take the product of that term inside of the bracket with itself. So that gives an s square. And then a product of two sums. And so now to be careful, I have to introduce... give a new name to the summation variable. So one sum we're going to have to say is over little s. Another sum is over little r. And then the individual terms in that inside of these sums, I'm directly going to multiply with each other and expand. Otherwise the slide is too small to do all of this. So there will be two terms like this. One with s, one with r, multiplied with each other. And in that, there will be two terms with f. One term with f and phi. One term with phi and f. And one term with phi squared. And the minuses are here and the pluses are there. Now I take the expected value. So I can already drag the expectation inside of the sum. And now we just have to be a little bit careful what the expectation is actually over. So here it's easy. These are expectations over one individual random variable, f of xs or fxxr. And here it's an expectation over a product of two random variables, xs and xr. And the function taken of them. So what are those? So there are several different ways to now proceed. One easy one might be to say, okay, so let's look at these sums. So there are two different cases here. Let's keep the outer sum unchanged and let's look at the inner sum. So within this sum for r from one to s, there are s minus one terms where s and r are not the same. They're different from each other. Those are going to be these ones. Now if s and r are not the same, then this here is an expected value over the product of two independent random variables. So their probability, their joint probability factorizes into two terms. Which means we can take the expected values separately and we know what those expected values are. They're just phi, each of them. So that gives us phi square minus phi times an individual expected value which we know to be phi from the previous slide. So that's phi square. Here as well, so that's minus two phi square plus phi square from back there. That's not a random number at all. It's just a deterministic quantity. So here we have something minus two times something plus something is just zero. So this entire term just drops out. We don't have to care about this big double sum anymore. So inside of this inner sum, there's only one term that survives and that's a term where r and s are equal to each other. When that's the case, we have an expected value over f of xs squared. So f of xs and then square the whole thing. And we don't know what that is. So we'll just write down that's what we need. So that's a quantity we actually might need. Minus phi square minus phi square plus phi square. So that leaves one minus phi square. Okay, so a single minus phi square. Okay, let's plug that in. And now notice that there's no s here anymore. So all these individual terms in the outer sum are all the same. So we just get s different copies of this quantity. So now we can get rid of the sum and just might add s times that. We also divide by s square. So one s survives and we're left with one over s times the expected value of f square minus the expected value of f square. And what that is is by definition what we call the variance of f under p. Interesting. So the variance of f under p, that's just a number. Hopefully it's a finite number. So that's a number that is given by exactly this, right? Which is a deterministic quantity. It's not a random number. It's just a number, a real number. And if we divide it by s, then of course that means as s increases, so as we get more and more and more samples, this quantity, which is of course an estimate for the error, right? It's the expected square distance of our estimator from the value it's going to be on average, that square distance drops over time. It drops like, well the square distance drops like one over s. So that means the square root of the square distance which is like an average absolute distance drops like one over the square root of the number of samples. So is that good or bad? Well, to understand whether it's good or bad, we would need to know a lot more about other kind of methods we might want to compare with. And I'm not going to do that, instead I'm just going to tell you that that's well, equally bad and good. So there are two different ways of thinking about this. One is that there are actually results in, well, stochastics that show that there is actually no better rate for such estimators. So random numbers that provide estimates for such a quantity that are unbiased, so that fulfill the statement or the property we derived on the previous slide, actually have to converge like this rate. That's the so-called min-max optimal rate. On the other hand, there are numerical algorithms for integration which at least in some settings can converge much, much faster than this algorithm. They're just not random numbers or based on random numbers and therefore they're not unbiased. There's not even a notion of unbiased there because there's no random numbers involved, but they do converge much faster. And so whether this conversion rate is good or not depends on what you're trying to do and what other alternatives you have at your hand. So here's a picture again of what I mean by that. Here is, it's actually based on the exact same experiment I just showed you in code. So this is really this piece of code. Whoops, where's my mouse? Here, let's just run that over and over and over again and let's compute our estimated values. Basically this is a quantified way to look at the experiment which I just showed you a piece of code for. So in here you see as the number of samples increases so as capital S increases I have drawn three or four different instantiations. Here's one, there's one, here's a third one and here's a fourth one of such experiments. So imagine that what I just did with you on the screen is we could start with S being equal to 0, sorry 1, 10 to the 0, just a single sample and then you just get some number and notice that you always basically have to get either 0 or 4 because you're either going to be inside or outside and then either to make sure it's 1 or 0 when you multiply by 4 so you get either 4 or 0. Now as you get more and more of these samples you can get lots of different values and you see that as the number of samples increases we somehow get closer and closer to the correct value which is pi which is obviously here at 3.141 and so on. What I've drawn around this is this sort of funnel of the convergence region. So how I got that was that I actually computed the variance of this quantity which you can of course do in a piece of paper for yourself for this simple experiment because we know what pi is right and divided it by S and then took the square root of that and you can see this kind of square root shape converging towards 0. Now notice that not all of the red lines are inside of this domain, that's because it's just an expected square distance. It's still a random number and it might well be outside every now and then but on average it's going to be inside and in fact on average this is actually the distance we're expecting to see and you can see that that's kind of true. Here's the exact same plot again but now I'm plotting this on a log log scale and of course to make that work I had to take the absolute value of these distances otherwise I would get negative numbers here so indeed the red lines are now absolute distances of the random numbers, the estimators, the Monte Carlo estimates from the correct value, pi and as a black line you now see the square root convergence why is that a straight line? You might want to think about that for yourself because it's a log log plot right and it's converging like well think about that for yourself. Okay so this gives a feeling for whether Monte Carlo methods are good or bad. First of all in a sense they are bad because to get high precision from this kind of computation you need an extremely large computational process. So here this function roughly sort of drops with like this black line roughly has a slope of minus one so to increase the order of magnitude of the sorry minus one half of course it has not just roughly it has a slope of minus one half of course because it converges like it's a convergence rate of one over s to the one half and so that means to double the precision so to move down by one order of magnitude in here so for example from minus one to minus two we have to move two orders of magnitude in the number of samples. So to get this number somewhere down to ten to the minus seven which is what you might want on a computer so that single precision you need to draw something like ten to the fourteen samples which is an insanely large number. You might want to try that for yourself think about how many samples your computer can actually draw in a second and think about how many seconds it will take to do that. So this is not the right tool to do high precision arithmetic on your computer. However, on the other end of this domain there is an interesting behavior here that we're already below relative error more or less well in the right ballpark so we're computing a number that is roughly one like it's three and we have an error of less than one after about something like ten samples so drawing ten samples of course is a very simple thing to do so if you are out to compute ballpark estimates for a number that you really don't know then Monte Carlo methods are a wonderful tool because they're very easily implemented you just have to draw random numbers and then evaluate the function so evaluating the function is trivial right if you can't evaluate your function well then you can't do anything right so the use for Monte Carlo methods is as in a way the first thing you could try before you spend time trying to implement something complicated or maybe also as the last thing you can try if you really don't know what else to do to solve your problem that makes them in a way particularly interesting because they will typically be at the top of your toolbox something you try out first then you see what things look like you put it back you think about how to actually solve your problem correctly so a typical setting to do probabilistic inference if you're addressing a new problem is to first try out a sampling algorithm see if it works get a rough feeling for what you're trying to do get a feeling for whether your algorithm works and then when you actually need to get a really precise estimate you then typically have to think about your computational step a little bit more with that we're finally at a great slide today we've been thinking about computational aspects of probabilistic machine learning we noticed that integrals are a key component of the computational side of this framework to compute almost any interesting quantity about a probability distribution you have to solve an integral and one straightforward way well maybe somewhat straightforward way and we'll see in a moment how straightforward it actually is is to draw random numbers that are distributed like P and then compute an integral against any kind of function you want more or less by taking those samples evaluating the function summing up those values and dividing by the number of samples this gives the Monte Carlo estimate this estimate is unbiased and its error so it's expected the square root of the expected square distance drops like one over the square root of the number of samples this means that it's quickly very roughly right but takes very long to be really right now of course the one challenge that I've thrown completely under the rug so far is how to actually compute those samples and after this great slide when you've taken a quick break we have to think about how to really do this okay so now that we've realized that we can use random number of two compute intervals the natural next question is how do we actually draw these samples and so a very first question is just how to draw samples on a uniform random interval but in a moment let's assume we already have access to uniformly randomly uniformly distributed random variables on the unit interval how do we use those to construct random draws IID random draws from other probability distributions well a first thing to do is to use actually the definition of a random variable hmm my machine first has to wake up here we go so this is a reminder lecture three we can map from one random variable to another by a change of measure which amounts to or at least for one dimensional functions can be sort of done by this transformation rule which is a bit unwieldy to think about because it actually just means you have to think of the function that maps from the derived variable to the input variable and take its gradient or derivative which is equal to the inverse of the derivative of the map from the original variable to the derived variable well that's this precise process actually to construct precise distributions if we're able to do so notice that we can do so if we know what the probability density function is we need to do only we want to get to only using derivatives rather than integrals which is convenient so let's do a simple example of that one distribution we might care about is this one so this is a probability distribution for which the probability to observe a certain sample is proportional to the negative exponential of the sample value this is called the exponential distribution it shows up for example as a fundamental process as the waiting time in a Poisson process so imagine you live in a city with buses that just go around on their routes over and over again never stopping but they're in traffic so they're typically some time lambda apart from each other but the actual waiting time between them is random then the distribution for your waiting time will be something like this so distribute it like this having access to uniform variables how would you get to this kind of distribution well here the situation is actually quite easy well I've given you already the answer already down here we take our uniform and the variable u let's call it u and take its logarithm and multiply by minus the rate this variable is then distributed like this function why is that well let's see let's actually do clean up the byteboard let's see if our math works out so the density for our uniform variable is just 1 for a unit variable between 0 and 1 and I've just defined or I've claimed that the interesting variable to look at is this one we by the way here's a bit of a notational issue maybe I'm calling the random variable u now on the previous slide because I took it from a previous lecture u was the transformation so you have to do a variable name replacement in your head I'm sure you can do that so the inverse of this function is u equal to x to the minus x over lambda of course and this means that the derivative of well okay let's say we follow the derivation from the slide before that means the probability density of x x of u is equal to the probability density under u of u of x that's just one though because of this times the absolute derivative of the u of x dx just took that from the slide well what is the u dx it's up here it's easy it's minus dx minus 1 over lambda so that's the inner derivative chain rule times the exponential of minus x over lambda and then we have to take the absolute value of that so we might as well just get rid of the minus right and that's exactly the kind of pdf that we're looking for so if you wanted to draw from this exponential distribution then what you could do is take your source of uniform random variables and take the natural logarithm of each sample multiply by lambda take the negative sign and you get this distribution what I've done here for this plot is I've essentially shown this process but I made a minor twist which is maybe confusing at first but actually might be an interesting observation so what you see in this plot here is this is u and this is x this plot is missing labels that's stupid sorry so here we're drawing u uniformly from 0 to 1 and then actually the black dots are the draws the red dots are regular intervals to think about the density and then each of these points on each of these gray lines is transformed into an x by mapping through this transformation that's the red line and then I'm just plotting here what the x values are and you can see that the the corresponding well what I'm plotting here in black is the density that we're looking for that's the pdf and what I'm doing with these black dots is I'm assigning them their x value and then I'm plotting at a uniform distance upwards from 0 to the height of the black curve at that point and you can see that the black dots are somewhat regularly distributed under the black curve which is exactly what we want if the black dots are distributed uniformly across this space from that is upper bounded by the black line then these draws are actually from the distribution that is defined by this pdf now if you stare at this slide for a while you might notice that this red line is not actually minus lambda log of u, here I've said lambda to 1 and maybe you'll come to the conclusion that what this red line actually is minus lambda log of 1 minus u and I'll let you figure out why that works as well another example is let's say we wanted to draw from the beta distribution actually drawing from the beta distribution is a little bit hard but let's first think about a simpler case so the beta distribution was the prior we used in our experiment trying to infer a probability so how many, what's the proportion of people with glasses and in as a prior distribution and maybe let me write it down remember that the beta distribution is beta of x so in the previous experiment I called that pi but let's not use pi too much alpha beta is 1 over the normalization constant that is given by this beta function which is the normalizing constant for this thing times x to the alpha minus 1 1 minus x to the beta minus 1 the minus 1s are just part of the definition and that's it the normalization constant b of alpha beta is just the integral over this x to the alpha minus 1 1 minus x beta minus 1 the x x goes from 0 to 1 now this is a bit tricky to work with but let's say for simplicity we're only interested in the case where beta is 1 1 1 and then this part here is just exponentiated to the 0 so we can get rid of the whole thing let's say that's the distribution we care about now what is that distribution for a general alpha like this then the normalization constant is easy this is a simple integral over a polynomial so it's just 1 over alpha so alpha minus 1 plus 1 times x to the alpha minus 1 plus 1 so x integrated from 0 to 1 at 0 this value is 0 and at 1 that value is 1 so we can get rid of it and we know that beta of alpha 1 is just 1 over alpha how convenient so how would we draw from this kind of distribution so what we actually care about is 1 over alpha x to the alpha minus 1 just the form of this already suggests what kind of transformation you might want to use you just have to now think a little bit in your head about which way around all the polynomials work and we could say you end up realizing that x will be given by u over 1 over to the 1 over alpha u is our uniform random variable why? because that means that u of x is just x to the alpha and then well the rest is straightforward we need again our p x of x of u is given by p u of u of x that's just 1 we don't have to worry about that times the absolute derivative of d u of x dx and what is that well it's 1 over alpha times x to the alpha minus 1 j in rule and that's what we want okay so that's relatively easy for the beta distribution but you can already see if you wanted to do this now for a general beta distribution where there's not a 1 here but something more complicated things will quickly get much much more complicated now as a homework for next week one of your math question this week actually is to do this derivation I'll tell you how it works and you just have to show that it's actually true so the way we draw from a whole beta distribution is so with two parameters is to consider two uniform random variables let's call them two different random variables let's call them x and y as our inputs and they actually have to be gamma distributed rather than uniform distributed how to draw from a gamma distribution is another little trick that is actually feasible from a uniform distribution with a little bit of trickery with this we've now realized that we can use random samples from a probability distribution to compute integrals against this distribution if you want to put it this way using the Monte Carlo method once you have these random numbers the remaining integral is actually easy while it's easy in the sense that computing the Monte Carlo estimate is straightforward that estimate is quickly so as a function of the number of samples quickly a decent estimate we want to become a very good estimate then we wonder okay, so that's fine but maybe the tricky part here isn't actually summing up those samples the really tricky part is how to produce those random numbers we've now seen that for basic cases there is at least a principle way to do that using the concept of a random variable so taking a base distribution and transforming into the space or onto the measure on which we would like however, we've now realized that this is quite hard to do for anything other than very basic distributions so the question we now have to poster ourselves is what kind of algorithms can be construct to do this more efficiently and generally and end up with algorithms that we can apply to more or less arbitrary distributions I'll talk about these now for a moment we'll today spend the rest of this lecture to introduce a few basic algorithms of this form before I get to that I want to mention on the side that one question I've totally brushed aside so far is what actually a random number is to begin with normally I would like to have this discussion at this point of the lecture however, it's really the kind of conversation that is extremely difficult over a recorded session so I'm going to keep this for the flip classroom we will talk about what a random number is and sneak preview I'm going to argue that random numbers don't exist so if you want to hear more about that you have to tune into the flip classroom and we will talk about what exactly is the kind of randomness we need here for these algorithms and why it's very difficult to precisely formulate philosophically what a random number actually is for today though or for at least for this lecture let's continue with something that's a bit easier for me to present to you in this kind of frontal way and think about what we need to do to more generally produce samples from more general distributions and for that maybe we first think about why this is actually hard to do so in as I already said in other formulations of machine learning, statistical formulations in particular where you only care about the minimum or the maximum of a function you can do some kind of local computations you just have to run down the hill until you reach a point where the function reaches its minimum where the gradient is 0 sampling is harder because we want to produce numbers whose density is given by the probability density function and the density is really a global object the density could for example have an additional kind of a point where it's maximized but that point can be very far away from the mass of the distribution most of the distribution can be really flat out it can also have little isolated islands that are far away so to have an algorithm that produces samples from a distribution we really need a form of a global description of the entire probability density function and that's really what makes sampling hard another challenge is that the function we're interested in actually isn't generally fully accessible so you notice that one of the first, well the motivating example I gave for why we need to solve these integrals in the first place is the computation of posterior distributions so in posterior distributions the object we can evaluate easily is the product of prior and likelihood and by easily I mean that these are functions that are available once you've defined the generative process but the hard part is this normalization constant and it's often confusing to people why this is even hard because this is just one number and it's really just scaling the entire distribution this seems to be the complicated bit up here that gives all the structure, why is it so important to get this right? because it's part of the computation not a probability distribution but of course you're totally right it would be nice if we had algorithms that just that just get rid of this problem of having to use this constant so what you would like to have are algorithms which are able to sample and they're even able to sample from a distribution if we don't know its normalization constant so in particular if we don't know this number so what we'd like to have is an algorithm that can draw from the distribution p assuming it's a probability distribution even if the only thing we can actually evaluate is a function called p tilde which is proportional to p up to some number z and z is supposed to be arbitrary and unknown in some sense there are algorithms out there that do that and we will deal with them primarily in the next lecture number 5 but today at the end of this lecture I'd like to introduce to you two basic algorithms which used to be used in practice actually they're still used in very special corner cases but which have more deductive value today to understand why it's hard to build computational algorithms that draw from a general probability distribution and the first one is actually an ancient algorithm that goes back to, well ancient to the 18th century to French mathematician called Georges-Louis Leclerc who then became the Count of Buffon later on in his life and it's called rejection sampling many of you have heard about rejection sampling before if you haven't, it doesn't matter here's how it works and if you've heard about it before we will have something interesting to talk about anyway so it's exactly an algorithm for the situation in which we're trying to sample from a distribution p tilde, that's this black line in this picture which is only given up to a normalizer and the one thing that is required is that there is another distribution q, that's this red curve in this picture we call this distribution a proposal distribution and we assume that we know for a fact that there is a number c, in fact we know the value of that number c it's a constant such that if we scale this red probability distribution or the actual probability distribution that gives this red line by this constant c then it is an upper bound to the distribution or the function we would like to draw from so that such that c times q is larger than p tilde so here the black line is a distribution we want to sample from is p and the red line the proposal distribution I've chosen to be a Gaussian distribution now I haven't told you yet what Gaussian distributions are but let's be honest if you're taking a probabilistic machine learning class and you've never heard about a Gaussian distribution then it's going to be tough for you anyway but just to remind you let me just write down the form of the basic Gaussian distribution to the Gaussian distribution in one week so if you find this confusing just wait for a week and you'll hear more about it the Gaussian distribution with a mean and a variance is given by a normalization constant times the exponential of a quadratic form and in 1D as in this picture that's 1 over the square root of 2pi times a scalar times exponential of minus x minus mu squared over 2 sigma squared where sigma actually defines the variance of this distribution it gives the width of this bell this bell curve and mu is the center of this bell curve here mu is 2 and sigma in this case is actually also 2 and yeah, that's it so this is a distribution from which we can draw very efficiently algorithms for it I'll talk about those in a week or so and so now here's how the algorithm works we're going to draw from this distribution we know how to draw from this red distribution and then for each such draw so let's say we've drawn this number here I don't know 3 something like 3 and then what we do is we draw a uniform random number from 0 to 1 scale it by the height of the red curve here times the upper bound so we do this this gives us a dot we draw we put that dot I mean we don't actually make a graph we think of putting the dot somewhere at this uniformly random location between 0 and the upper bound and then check whether that number let's say we put it here whether that number is above or below the black curve for that we have to evaluate the black curve once right and we have assumed that we can do that so if that dot is above the black line so if it's larger than p tilde we reject it we throw it away we never talk about it again and if we if it's below the black line then we accept it we add it to our stash of samples so by that I mean we add the value x of this input domain not the height of the bar as we do so we collect the number of samples and those samples are actually IID draws from the black curve why is this the case I'm going to show you a little proof of this rather than a formal one which is just look at this graph what you can see here is these dots below the red curve I've produced those by drawing from the red proposal distribution and for each individual sample drawing this uniform number between 0 and the height of the curve and then putting the dot there so this is a process of producing points that are uniformly distributed which is outlined by the height of the probability density function so these are clearly IID draws from the red curve and such such points then are just uniformly distributed on this non-linearly delineated region so using this rejection sampling method what we're doing is we're just cutting out a set of samples that are uniformly distributed the area outlined by the black curve and therefore they are IID draws from the black curve fine, great so this is a method that works and in fact this is actually a method that is still used in practice in very low level libraries to produce samples from basic simple distributions for which it's possible to write down very efficient such proposal distributions for example the low level library algorithms that draw from gamma distributions are typically produced in this way and actually even the internal algorithms that draw from normal distributions typically have an internal step that is essentially a rejection sampling method so this algorithm is around it's not wrong it's also very old but it has a flaw which gives us a feeling for why it's hard to draw from probability distributions and that flaw becomes apparent I mean it's already somewhat apparent in this picture which is that if there is like a spike essentially somewhere in your distribution and that spike is hard to capture with your upper bound distribution then you will be forced to create a situation like in this picture where there's a lot of space here and there and over there where we have to throw away samples so in this picture roughly half of all the samples visually will have to be wasted, right? we have to produce them just to find out that we can't use them now that's already a problem in one dimension in two dimensions the situation gets even more complicated so actually not in two but in multi-dimensional settings and it can be sort of intuitively you can imagine that if this is, if this were a higher dimensional space then the volume that we have to throw away due to rejection gets ever larger and it quickly becomes dominant to get a feeling for this here is a great intuitive example that stems from the book by David Mackay that I've mentioned before from chapter 29, section 3 in that book actually that gives an intuition for why this this rejection rate can rise very quickly as the dimensionality of the problem increases and it's a thought experiment it's obviously not meant to be a practical setting but imagine you wanted to draw from a Gaussian distribution like the one I just wrote down here using rejection sampling that's of course a stupid idea but let's imagine you wanted to do it so we have one Gaussian distribution that is centered at zero and it's this black line here that's our P and we would like to use another Gaussian distribution with another variance as our proposal distribution Q so that's the red dashed line here it's also centered at zero and of course we have to give it a wider variance it's going to be very difficult to do this so let's say sigma Q that's the variance of this multivariate sorry of this red dashed distribution is larger than the one of the distribution we want to sample from now we need this red curve to be above the black curve so we need this upper bound what's the optimal value for that sort of upper bound C to get this to just about this red curve as high as the black curve you can think about this for yourself for a moment but you'll realize that just visually that this red curve will be above the black curve if we make it match it up here at zero so if you look at this expression here on the board and set mu to zero and then set x to zero because that's where the center of this distribution is then this entire term is just one and we can easily find what we have to set Q to to make this larger than the other thing this by the way is a determinant of the covariance matrix in this case I've assumed that this matrix sigma is given by a diagonal matrix of elements all the same sigma squared so it's essentially just sigma squared times the unit matrix and then what we have here this determinant is just sigma squared to the n and the square root of that which we have to take here is just sigma to the n or I have decided to call this d I'm sorry d which you can see here and now clearly just to make the normalization constant of this of this dash distribution larger than that of the black distribution we just have to take sigma equal to sigma Q over sigma p to the d and you can already see where the problem comes from so this number of course increases with exponentially with d so as the dimensionality increases we have to make sigma Q sorry we have to make C exponentially larger to keep this upper bound on the distribution we want to draw from now the ratio of acceptance is proportional to the volumes of these two distributions so to remind you of this picture the acceptance probability is proportional to the volume under the vet curve versus that under the black curve now those volumes are actually given by these normalization constants so this normalization constant is for this diagonal case is exactly this so the term will be low or actually just 1 over C for the relative volumes and that means that the rejection rate will rise exponentially in d and this happens very quickly even if the sigma Q is only 10% wide or then sigma p as in this picture then in a 100 dimensional space 100 dimensions is not a lot in a week or two from now we will talk about thousands and million dimensional spaces then the acceptance rate is already 1 in 10,000 which is an extremely wasteful process especially if you keep in mind that we are only using this to produce samples and the sampling estimator itself already improves in performance with the square root of the number of actual samples so the number of actual samples of effective samples is the number of samples we draw from the proposal distribution times this or divided by this acceptance ratio so that means that to get to an even reasonably correct estimate we have to drastically increase our computational resources so really rejection sampling only works in extremely low dimensional settings so we need to have a better algorithm to fix this kind of issue so normally at this point in a lecture they ask whether people can think of a better algorithm and then if I'm lucky what usually comes up is the notion of important sampling important sampling is a variant of rejection sampling or maybe it's a minor improvement of rejection sampling which ensures that samples aren't actually thrown away they are kept around and are just weighted by how likely they would have been to be accepted in many ways important sampling is essentially a continuous version of rejection sampling so instead of throwing away individual samples we instead compute we use the following idea let's maybe go a bit slower to this so here's the quantity we would like to compute that's our estimate again that we've now seen several times on the slides now assuming that we have a Q of X our proposal distribution and we don't need it to be larger than P anymore we just need to make sure that it's larger than 0 wherever P is larger than 0 so that it has support then we are allowed to extend this expression by just essentially multiplying with 1 and drawing the Q Q over Q into the expression P over Q so now we can think of phi as an integral against Q of a different function not F of X anymore but it's F of X times P of X divided by Q of X and we can do Monte Carlo estimation on this problem by drawing from Q from our proposal distribution and for each sample evaluating not F of XS but F times P over Q and this term P over Q is then often called a weight an importance weight why? because if Q has a very small value at this point then that means X of S is drawn because Q has a low value so when it's drawn it should be given a high weight because it actually represents a larger region of mass in P or a larger density in P then it does under Q interestingly notice that this is just Monte Carlo estimation so there is no approximation here anymore so not at all actually so this means this estimator is unbiased and its error drops like one over the square root of the number of samples so that's great it's really like an estimator we might want to use purely on these theoretical grounds but actually before I give you the big but there is also a variant of this so what I've written here obviously only works if you actually have access to the true P after normalization now let's say if you don't notice normalization and there's actually a trick that allows you to fix this by you just introduce this unknown Z and then estimate Z as you go along so you can stare at this for a while to convince yourself that that's possible so basically you just compute an estimator for the normalization constant alongside the estimator for the true integral doing so is not unbiased anymore because you are estimating two variables at the same time and you can show that if you're trying to do this proof that we did on previous slides that for unbiasedness then this proof is not going to work in this case because you can just not separate the integrals anymore you can imagine that that's the case because there's one big integral over here which you cannot drag into the sum anymore however well yeah so fine then it's biased but it's still Monte Carlo estimate so the estimate of the normalization constant will eventually become right and over time this will become kind of a good estimate again so that actually is not the big problem the big problem is that what I've left out here is when saying that the error of this estimator drops like one over the square root of the number of samples I've basically used a big O notation like it drops like and I've left out the normalization sorry the constant in front of that rate and what is that constant well that constant is now given by let me just actually go back in the slides to show you this again so we found before that the variance of the Monte Carlo estimator is given by the variance of the function it's trying to estimate which is a constant divided by s now that variance is now not of f anymore but it's of f times w of this other function this function can have a much higher variance than the original f under p in fact it can even have unbounded variance and of course if the constant is unbounded then this convergence rate doesn't help us anymore at all so to give you a feeling for this and that's the final thing we're going to do today before I stop let me show you another picture which is a bit of a confusing picture so you might have to stop the video at some point and stare at it for a while but to show you here is this is a bit of a constructed situation is let's say we do important sampling from this we want to draw from this black distribution using this red distribution as a proposal distribution so notice the strength of important sampling the wonderful thing is that we don't have to have our proposal distribution be an upper bound to the the true distribution anymore we don't care it can be in principle anything as long as it has support wherever the black line has support so here I'm using a Gaussian distribution as a proposal distribution Gaussians have full support everywhere so this is always true and now what I do is I draw these red dots from the proposal distribution the black dots by the way are actual draws from the correct black distribution which in this case I can do because it's an experiment for comparison and the golden line is the important weights so what that is is it's the ratio between the red and the black curve and actually it's the ratio between the red and the black curve multiplied by f so that's actually the function which we're trying to integrate if you like so this is f times p over q and f here is let's say we want to estimate the mean of this black curve so f is here this linear function this gray bar here now you can see you can guess kind of that this golden function has a much more complicated distribution under it may be sample from the red distribution then the black curve would have times this linear function and that's of course actually what we see as well so here in this bar plot in black you see the empirical distribution so just a histogram of samples drawn from the black curve and multiplied by x so samples of whose mean is going to be our estimate for the mean of this black distribution and in red you see the corresponding important sampling the numbers so that's f times p over q and you can see that this red distribution is much much broader than the black distribution it has a much higher variance and in fact there are even settings where and you can actually find them in a nice exercise in David Mackay's book that shows this it's a different setting than the one I have on this picture here but there's even situations where the variance of these of these functions is just unbounded so you could even get infinite variance for these kind of estimators and it's in practice very hard to detect when that happens because these very large values that actually make the variance unbounded of course happen extremely rarely so we don't really get rid of this problem of massive reduction of convergence rate when we move from rejection sampling to important sampling we just get it's just easier to design an important sampling method than it is to design a rejection sampling method but the problem of slow down convergence rate remains and in fact the sort of this exponential slow down in higher dimensionality is still there it's just a little bit more hidden away because the we actually get samples without rejection it's just that these samples have much much higher variance potentially so with that I'm actually at the end of today's lecture it's a bit of a stop because I'm leaving you with a a bit of a cliffhanger before we move into actual methods that help us address this computational issue of sampling I told you today that integration is extremely important in probabilistic machine learning it's the fundamental it's sort of the defining operation of probabilistic machine learning one way to address this integration problem is to use random numbers sampled from the probability distribution against which we want to integrate to compute such integrals using the Monte Carlo method this is a framework that very quickly produces rough estimate rough estimates and only very slowly produces precise estimates it's maybe a tool that you want to try first because it's usually easy to implement but it's not the most efficient way in general to solve integration problems we found that in any but the simplest possible base cases there is no analytic way to draw these samples so that's actually the challenge in designing Monte Carlo methods not the integral itself by producing the samples there are at the end of this lecture we saw some basic methods for producing samples from non-trivial probability distributions these work in low-dimensional simple settings but they typically do not scale to challenging particular to high-dimensional problems so in the next lecture we will have to deal with algorithms that have been developed over the past century actually that scale much better to higher-dimensional settings these will also not be perfect actually no algorithm is ever perfect but they get us way closer to a very interesting performance and can be used for real problems thank you for your time