 Everyone actually in the room and everyone at home. Today is the pre-Christmas lecture, so we're a little bit more comfortable crowd here in the room. And that nevertheless means we still have a whole lecture ahead of us about integration. So last week we spoke about how typically people do integration and machine learning using Markov J Montecarlo. And today we'll talk about integration as a computational problem with less thought about how to do it most efficiently and more about how to do it properly. So I want to use this opportunity. I said last week already that integration can serve as a good test case for how one might want to think about computation. And today I really want to focus on this thought. So we're going to pick one particular integration problem and then think about it carefully, slowly, trying to understand what actually happens when you're computing something. And in which sense you can think about this process of computation as learning. So here's our problem. It's a function. It's called f of x. X is a real number. So it's a real valued function that is given by the exponential of minus sine squared of three x minus x squared. So sometimes I've been known sometimes even in job interviews to ask people what does this function look like. Now you've all got the slides in front of you so there's not really a point in me asking you this. But it's a good idea to pause for a moment and think about what actually happens when you think about the shape of this function. So here's a set of strings or a string, a set of characters. That is, I think I counted once give or take 20 characters long. And by writing down that string I've uniquely identified one function. So we all have, thanks to your undergraduate education you all agree on the formal language that describes this function. You know what those characters mean. And that means there is a unique object. You could call it a function or a unique graph, a unique line in a coordinate system that identifies this function perfectly. And that means in some sense there is nothing to be uncertain about, right? You know everything. Once I write down this function you know everything you could possibly know about this function because it uniquely identifies one function. And that function of course looks like this. So one way to think about this is that you can think of this as a product of two functions, the x of minus sine squared vx and x of minus x squared. So x of minus x squared is the Gaussian bell curve. It's this thing that everyone has in their heads already. And x of minus sine squared 3x as well. So sine is a number between minus one and one. Sine squared is a number between zero and one. So x of minus a number between zero and one is a number between one over e, e to the minus one. And one, e to the zero. So we've got our Gaussian bell shape multiplied by some oscillating function that keeps going up and down but doesn't go to zero. It only goes to one over e and then comes back up to one. So the Gaussian bell is like an envelope around this function. So here's this curve. It's unique. We know everything about it. But that doesn't mean that we can answer every question about this function directly. So if I define another object called capital F that is given by the definite integral from minus three to plus three over this function, then you don't know what that number is. Even though I've uniquely identified this little f. You don't know because A, you can't find that number, an expression for that number in a table of integrals I've checked. So there are these big books full of integrals, right? And let me close that so that you don't get blinded. Like the Gratstein, for example. But if you check in there, you're not gonna find this integral. It's not in there. But you also don't know because there is no atomic operation, no function in the new C library that just evaluates this integral. In contrast to the gamma function or something like this. So since you don't know, you clearly are uncertain about it, right? That's what uncertain, being uncertain means. There is something we don't know. We can't, there's a question we can't answer. But we're not totally uncertain about it. We know, oops, we know something about this problem. Let me fiddle around with my slides a little bit. And here's maybe a point for a question, even if you already stared at your slides. So what do you know about this integral? Can you say some things about this number F, capital F? Yeah, so it's finite and bounded in the following sense. So first of all, we can notice that this function is, it's upper, well, maybe it's even easier to say it's lower bounded by zero, right? Because it's the exponential of something. So that function is non-negative. So the integral over this non-negative function has to be at least zero, right? It's larger than zero. And secondly, as you said, this function is upper bounded by one. So if you think of a box from minus three to plus three over one, that has area six. So this function, this integral, capital F, lies between zero and six. In fact, yeah, exactly. So we can also put this upper bound of the Gaussian around it, and then we know that this integral has to be less than the integral over this Gaussian function. So the integral over this Gaussian, so the exp of minus x squared, over the indefinite domain, so from minus infinity to plus infinity is square root of pi. That's something you can find in your textbooks. We're looking for the definite integral, though, from minus three to plus three. So we have to multiply by some correction factor, which happens to be, if you think about it for a bit, the error function of three. It's not entirely straightforward to see that. It's clear that it has something to do with the error function, but why it is exactly three and not one half times of minus whatever, but it's, yeah, error function of three. And you can, that's a function that is in the GNUZI library. It's in sci-pi.special, so you can call it. And if we call, treat that as a non-numerical operation, something we can just do, then we'll get 1.77. So now we know that this number, first we realize that this number lies between zero and six, and then we realize that it's actually between zero and 1.77. So that's already, that's a pretty narrow range, right? So we're not infinitely uncertain. And this is totally typical for computations. It's, I actually can't think of a situation where you don't know anything at all about the result of the computation. You can usually bound it quite well. Sometimes vaguely and sometimes very precisely like in this case. So I'm showing you this plot to motivate why it might not be a bad idea to treat the problem of finding this integral as an inference problem in the Bayesian sense by putting a prior over the unknown object and the quantities that we can compute and then performing Bayesian inference. One reason why we might be able to do that is that we begin with an actual prior that we are finitely uncertain about. But before we go to this Bayesian interpretation, let's first think about the connection to previous last week to Monte Carlo. How would we solve this problem with Monte Carlo? So you're saying rejection sampling. Yes, you can actually do important sampling as well, which is a little bit more efficient. So here it is on the left. I'm drawing random numbers from the Gaussian distribution with standard deviation one over the square root of two. So that's this standard X of minus X squared distribution. And then for every draw of this distribution, I evaluate the ratio between the red curve, the function we care about, and the gray curve, the Gaussian curve. So I know what the integral over this Gaussian curve is. It's the thing that was in the previous slide, like the square root of pi times the error function of three. So I multiply that number, which is capital G, by this Monte Carlo estimate for the average ratio between F and G. I'm just summing them up. And if I do that, I get an estimate for this number, that this unknown number that lies between zero and 1.77, which I'm plotting just a moment here on the right as a function of the number of function evaluations. And what I'm plotting here is the absolute error between this Monte Carlo estimate and the actual value of this integral, which I have computed by some magical means, of screen kind of, right? So now your question. My question was also a problem. So this plot on the left is mostly just a visualization. So what I'm actually drawing are the Xi. So those are the locations on the x-axis. And now I made this plot by taking all these x-locations and then at each point drawing another uniform and the variable and drawing that's between zero and one and multiplying it with the value of the Gaussian at that point. Why do I do that? Well, because now you can see that the density of these black dots under this curve seems uniform, but it's sort of visually uniform. There's no structure in this thing, which is a good unit test sanity check to see that those samples seem to make sense, right, that they're not mis-scaled somehow. But actually, so I wouldn't, in fact, need to precisely draw from the Gaussian distribution. The only thing I need to do is to draw from a distribution that has support on the domain we're trying to integrate over and for which I know the analytic integral. Because I need to be able to multiply with this g and you to know what g is and then I need to be able to evaluate those ratios. So if my function doesn't have support on this domain, then I'll have divisions by zero errors, basically. And if I don't know what g is, well then I don't know what I'm computing. So I could in particular also use a uniform distribution. So this is not specific about Gaussians, right? I don't need to have this upper bound. So here I have a second line in green which draws uniform random numbers. So numbers between zero and one, then I subtract one half. So now I have numbers between minus one half and plus one half, I multiply by six. So now I have numbers between minus three and plus three. For each of these I evaluate this ratio between f and g. So g is just one basically. So like up to the g which is equal to six. Fine, and then just sum those up and I get an estimate. And on the right you can see those two curves converging towards zero. And in the sin gray lines are lines of slope one over square root of n. This is a log-log plot. So those lines look like straight lines because it's a power law. And the dash lines are those analytic statements about the convergence rate that we had in the previous lecture. So they are given by the variance of the samples. Sorry, the standard deviation of the samples divided by the square root of the number of samples. So that's the theoretical statement for the expected error of this estimate. And you can see that that's actually pretty good. You might notice that those lines tend to be in the upper part of this distribution of dots. That's because it's a log plot, right? So this down here is much further, much closer to this number than up here. Okay, so that's the Monte Carlo way of doing things. And last week we realized that this rate, this convergence behavior, this random process is in some sense optimal because it's the best unbiased estimator of this quantity drawn from random variables. So today our goal is going to be to beat this by quite a bit. So how would we beat that? So now of course I could, if this were like a lecture on classic numerics, we could now go and write down all sorts of classic quadrature rules. I'm not going to do that. Instead, let's imagine that on your way here today, when the snow is gone, but imagine you had a little bike accident this morning, like you got hit by a bus or something and you have some kind of retrograde amnesia. And now you forgot everything you learned in high school about how to compute integrals, everything. But you're still left weirdly with your beautiful knowledge of probabilistic machine learning that you got last term from Professor Macke. So if that's the case, then you can treat this problem as a Bayesian inference problem. You could just say, here's a, here are quantities that I get to see that are called f of x. I can evaluate f of x wherever I want. And there is another quantity that I care about, that quantity is called capital F, that's the integral. And I can use the framework of Bayesian inference to construct an inference algorithm that tries to estimate capital F from evaluations of little f. Little f are the things I can evaluate, they are data and capital F is the latent quantity, the variable that I would actually like to know. So to do that, let's write a little bit of math on the slides. So we start by putting a Gaussian process prior over f, why a Gaussian process, what else, right? Is it the first knee-jerk reaction you could have because it's the go-to Bayesian regression model for functions. And we're going to choose a prior mean function and a prior covariance function called the kernel. We know from previous lectures and also your GP lectures that Gaussian processes have this beautiful property that any linear operator acting on this function induces another Gaussian distribution. So this Gaussian distribution is associated with a Gaussian distribution over any affine map of F, which has a mean, which is given by the affine map of the mean, and the covariance, which is given by the affine map applied to the covariance from both sides, sort of. What that exactly means, we'll see in a line. So in particular, this also holds for the linear map that is the integral, right? Integration is a linear operation in the sense that integral of F alpha F plus beta G is equal to alpha integral F plus beta integral G, right? So that's a linear operator. And that means that this Gaussian process over little f is also at the same time automatically a Gaussian distribution over the integral. If there's just one integral we're trying to compute, that is literally just one Gaussian when available, right? Like one bell curve. We could also think about the indefinite integral. So from, I don't know, from minus infinity to X or from A to X, then it would be a function of X, which will be a Gaussian process. Okay, so that's the idea. But now this has actually get us anywhere. Well, that means that if we have a Gaussian process over the function F, then in principle, there's also a Gaussian distribution over the integral, but can we actually evaluate that integral? Well, to see, let's see what we have to do. So let's say we already had some observations of the function F. We've evaluated it at a bunch of points because if that's what we're going to do, right? Then we're going to get a Gaussian process posterior mean. And we've now seen that several times, even in the lectures here, but also in previous lectures, that posterior mean, and sorry, Gaussian process posterior distribution has a posterior mean and a posterior covariance which look like this. So there are these functions that involve a bunch of linear algebra. They involve evaluations of the function value at various locations, inverses, or solutions of linear systems of equations that involve kernel grammatrices. And then there's the function mu of X evaluated at some test points and the kernel evaluated at a test point and all of the training data. So if we choose this form for this Gaussian process and then think about this object here up here again, then that means to get the distribution over the integral, we need to compute the integral over this object. So this is a function of little x that we're trying to integrate against x. So notice that that means because of this property of integrals, we can integrate the mean plus and then we have to integrate only this function here, this kernel function of little x. And on the right for the covariance, we have a bivariate integral from both sides over this covariance function that first is a bivariate integral over the kernel itself plus an inner product between these functions that already showed up in the prior mean or in the posterior mean here. So through this operation, through this kind of derivation, if we have evaluations of the function at various points, then we have embedded the problem of computing an integral in the reproducing kernel Hilbert space, right? In computing integrals over the kernel. So if we know how to compute integrals over the kernel, then we can compute integrals over any function at least approximately that we're trying to learn. Because these expressions here do not involve any integrals over F, right? The function values F only show up as actual values in Y as numbers, as floating point numbers. I say embedding because one term for this object here is the kernel mean embedding that you may have heard somewhere else before, maybe in the statistical machine learning class. So that means now that we have to be able to compute integrals over the kernel. So we have to choose kernels for which we can do the integral in closed form. Let me pick one kernel. What's your favorite kernel? My turn, ah, very good. Okay, you've been warned not to use the Gaussian kernel. Then my turn kernel would be one possible choice. I'm gonna make it even easier and I'm gonna choose the, now I've lost focus again. I'm going to choose big reveal, drum roll, the vener process kernel. So that's the minimum kernel. A kernel is given by, for the covariance between two function values, the minimum between the two input points minus a constant. The constant is necessary to make it positive definite. One simple way to choose the constant is to set it to something smaller than the smallest function value you're going to operate on. We know that we're going to integrate our function from minus three to plus three. So I'm going to set C to minus three. Okay, that's a function of X and X prime. It's a very simple function. So the minimum function looks like a relu basically, right? Like a rectified linear unit. So it's very easy to integrate. You can do that as a homework exercise if you want to. By the way, there's no homework this week, but you know, here's something you could do. And if you do that, then you'll find that, oh, that this particular integral, this kernel mean embedding, is given by this function. So it's one half. So this is a function that goes, that computes the integral over this kernel from A to B, integrates over the first component of this kernel and then leaves the second one as an unbound variable. So it's still a function of one variable, in this case X in this notation. So it's one half X squared minus A squared plus X B minus X minus B minus AC. So B and A are the integration limits and X is the variable that this function depends on. It's now a function of one variable. Well, actually, it's also a function of the integration domains, of course, but we'll assume that that's constant. And the bivariate integral you can also do. So you basically take this function and you integrate it again against X and you'll get this thing, which is just a number. If B and A are constants, then it's just a number, right? And for simplicity, I'm going to say that the prior mean function is just zero because then the integral is just zero. So that's gonna be really easy. It works well for this particular function because that function goes nicely back to zero. If it wouldn't, you'd need to do something slightly different. So that's it, actually. So this is something you can implement on a computer. So I hope you can convince yourself that these are two functions that you can write down in Python, right? They're very straightforward. They're not complicated. They don't need any complicated functions. Well, they need power functions, squares, right? But then cubes, but other than that, nothing. So you can certainly implement those. And now we're going to use this to do Bayesian inference. So we start with our function again in black in the background on the left. Here is our Vino process prior. Vino processes have this interesting property that they are sort of asymmetric, right? They are not stationary. They start at some small point and then the variance keeps growing. On the right, we still see our mark of, no, not mark of change, just Monte Carlo estimate. And now we're going to condition on a bunch of function values. So I evaluate at three points in this case, at minus three, at zero, and at plus three. And you get a posterior Gaussian process. These red lines are samples from this Gaussian process. The Vino process is very rough. It's the famous, it's the Gaussian process that corresponds to Brownian motion. It's the oldest Gaussian process. It was studied already by Einstein when he wrote his Anos Mirabilis paper about Brownian motion. That's exactly this Gaussian process. So if you keep evaluating, now we have five evaluations. You see a red line appearing up there. The red solid line is the point estimate to the integral over the posterior mean. Or the posterior mean estimate for the integral, which are the same thing, because the expected value of an integral is the integral of an expected value. So if you keep doing that, then our process sort of slowly begins to converge against the true function. And there's a big step just now and it keeps going, gets better and better as we go on and converges really fast at some super high rate that we'll need to discuss in a moment. So first of all, before we go on, let's first marvel a little bit at this result. This is evidently much better than the Monte Carlo line. Hopefully you'll agree. How much better is it? So let's say you wanted to do, you wanted to achieve something like 10 to the minus seven relative error. Roughly, that's maybe good because that should be good enough for everyone, relative error 10 to the minus, or not relative, but total error 10 to the minus seven, which in this case is almost the same because the integral has value roughly one. Then you would need something like 200, a little bit less than 200 evaluations with this integration method. How many would you need with Monte Carlo? Well, you can sort of visually try to extrapolate this black line until it reaches this 10 to the minus seven and you'll end up with something beyond 10 to 11. So it would take years and years to sum to this. And this evaluation takes, it's basically instant, you don't notice, right? There's no runtime here, but it's, you don't even notice, okay? So this is clearly much faster. Okay, so first of all, that's a good thing. Now you might wonder though, well maybe you've already made out all sorts of interesting observations about this plot and what we're gonna do for the rest of the lecture basically is to talk about this plot, about all its properties, all the things that seem wrong with it, and all the things that seem good with it and think about them. That's gonna be the entire rest of the lecture. So a first thought you might have is, okay, fine, fine, fine, but we haven't talked about how expensive this is, right? So Monte Carlo is very cheap in a way. You just draw random numbers, you evaluate the function and all the random variables, all the, sorry, and all the locations given by the drawn random numbers, and then just sum them up. So summing is very easy. It's a very fast thing, it can even be done in parallel. How expensive is this? So first of all, notice that this plot kind of makes sense because we're plotting against the number of function evaluations because both of these methods need to evaluate the function n times. Either you do it at random locations for Monte Carlo or you evaluate on this regular grid that I just happened to choose. We're gonna need to talk about why I chose a regular grid in a moment. So in that sense, these are meaningful comparisons but of course in the second case, we're doing Gaussian process inference and maybe you still have, despite lectures two and three and four, this kind of vague feeling in your stomach that Gaussian process inference is somehow expensive. Well, this one isn't because this is one of these processes that you can implement as a Kalman filter. And we spoke about Kalman filters in lectures five and six and seven, in particular five. So let me tell you, let me do the one slide on Kalman filtering. The one thing that I'm gonna say at the beginning that you don't need to understand is Nataniel already pointed this out and Jonathan Schmidt briefly as well that there is this limit case, the notation for the limit of Kalman filters towards infinitely small steps that's called the stochastic differential equation. So a natural way to write this process in this language that is in the first line. If you haven't heard about this, it doesn't matter, we'll get to the correct line for the one that you understand in a moment. So it turns out that this particular process which is called the VINOR process is connected to this weird object D omega or DW depending on notation that you may have seen somewhere before. It's this VINOR measure. So it can be written as the solution of this process that happens to be zero DX plus one D omega. And that means we can define state space that consists of two states. The first one, so state spaces should make sense to you, right? You've seen those in the previous lectures. The first state is going to be the function value itself. That's the thing we get to observe. We need to have that thing because otherwise we can't condition on observing it. And the second thing, the second state is going to be the integral the thing we care about. We need to have that as well because otherwise we can't make statements about it. So these are the two minimal things we need in our computation procedure. And we'll couple them to each other by this beautiful ordinary differential equation, linear ordinary differential equation given by basically this matrix F. You don't need to understand this weird thing called the VINOR process. But if you think of an ordinary differential equation that just looks like DZ of X where Z is the state is given by is equal to F times Z. D, X or DT, that's a bit, okay, I've messed around here with T and X. Then you can maybe convince yourself that if you have this kind of linear map between the two states that encodes integration, right? So a simple way to think about this is actually in the next line. So if you discretize this process on a finite dimensional step, then that amounts to computing updates to the integral as the old value plus the length of the step times the value of the first state before. All right, so the integral of X plus DT is the integral of X plus DT times the derivative, so little f of X, the integrand before. That's what this equation says. So we can actually discretize this stochastic differential equation in this form and now we're back at notation that you've seen before. So these are the variables that define Kalman filter. They define a prediction model A and Q. You remember that you can define these linear time invariant systems with these matrices A and Q where A is the deterministic step and Q defines the stochastic part of the process and then we need an observation model. The observation model is defined by these variables H and R. H tells us what we are seeing. Well, we get to see the function value but not the integrand. So there's a zero in the first and a one in the second element of this matrix and R defines the noise in the observations and here again similar actually to simulation methods for differential equations. R is zero because we just assume we can evaluate the integrand value wherever we want without noise and we also need to initialize. So we do that by just setting the initial function value to what we observe at the beginning of the integration domain. The integral, we initialize to zero of course because at A at the beginning of the integration the integral from A to A is zero. So we set that then after that we know exactly what's going on so we can set the prior covariance to zero prior precision to infinity basically and then we just keep running the Kalman filter. So that I'm not gonna write down again those lines of the Kalman filter that you've seen before but they are once you know what A, Q, H and R and M1 and P1, R they are just an algorithm that you can just call and it runs and the only thing I have to tell it is how far to step and then you can just look up what's actually in here. These are some particular processes that arrive from this choice of kernel and you see that the update equations for the integral look like this. So at each point in time we evaluate the function that gives us a value Yi so that we just set our estimate of the function and then we update our estimate for the integral which is given by this line. It's the old integral plus half of the step size times the observation here plus the observation at the previous step. So that's an average of the last two evaluations so the current one and the previous one. And then we get to update our covariance. We see the function value so we're totally certain about the function value and the integral gets increased the uncertainty about the integral gets sort of increased a little bit by something that has to do with the step length cubed. So now there are various ways to think about this. First of all, the fact that this is a Kalman filter means we can implement this in linear time. So this algorithm is as expensive as Monte Carlo. It's just a sum over a bunch of values multiplied by a constant but again I mean the Monte Carlo estimate also needed us to evaluate some other functions and multiply by them. So that's fast. The second thing is we can look at this expression and wonder what kind of estimate this is. So there are various ways of finding out sort of the next insight. The first one is to just look at this Gaussian process and ignore the fact that it's a Kalman filter and I can just remind you that, let me go back. So Gaussian process posterior means, here is our posterior mean. These posterior means are weighted sums over the kernel. If the prior mean is zero, then what this is here is a, I mean this is a vector multiplied by a matrix so it's just a bunch of numbers, a vector multiplied by kernel functions, right? So this is a weighted sum over kernel functions and the kernel function is the minimum function so it's a piecewise linear function, a relu basically. So a weighted sum over piecewise linear functions the posterior mean over our integrand is a piecewise linear function. And if you sum together relus, you get more a new relu. No, not a relu but a new piecewise linear function one that sort of just goes up in there. And in fact, it's a sum over as many of these piecewise linear functions as we have evaluations and it's supposed to interpolate our function so that it goes straight through the evaluations that we have. There's only one line that has that property. It's the linear spline, right? The function that goes through all of the function values and then extrapolates like a constant at the end. And we actually see that here, right? It's just a bunch of straight lines between the evaluations. So here we've evaluated five times here, here, here, here and here and we get a straight line as the posterior mean. It's a bit difficult to see between all those vividly samples but that's it, right? So what's our estimate for the integral? Well, it's the integral over this estimate. So what's the integral over this piecewise linear function? It's actually the thing that we see in the Kalman-Filter formulation. It's the sum over the average values in each block, right? So in each of these regions, we have a straight line that goes from the left end to the right end and the integral is just, well, it's just the area under that trapezoid, right? Ooh, trapezoid, ah, it's the integral over the trapezoids. It's the trapezoid rule. It's the second most simple integration algorithm you've ever seen. The second most, because the most simplistic one is Riemann sums, where you just start on the left and then you just take steps to the end and then move up a step. That's actually almost the same up to some technical complication. This is maybe the more natural choice and you've used this in high school, right? Maybe for your A-levels, you have to compute an integral at some point and one way to do that is to just draw straight lines and compute the areas under those straight lines. So what we've just done is we've discovered that this very basic integration method is actually Gaussian process inference. So here's a classic algorithm, maybe the most classic integration method, that turns out to be the posterior mean of some complicated Gaussian process inference scheme. And it's directly connected to it, right? It holds for every choice of how many evaluations I make, even if the grid is irregular. And this is one of these high-level insights that I want to get across in this lecture. Even if you don't care about integration, this is a very important and actually quite common result that it turns out that you can think about various classic algorithms as the red line in the middle, so the posterior mean, over a Gaussian process inference scheme. This can also be done for... Basically, every other numerical problem. So for linear algebra, there's a corresponding relationship that you actually already saw in lecture number three, I think, for various iterative linear solvers, like the Cholesky decomposition, like conjugate gradients and various other ones, that they arise as least squares estimates over either the solution or the matrix and least squares estimates are essentially Gaussian estimates up to the uncertainty. For integration, we've just seen that the trapezoid rule, and I'll show you more examples later, turns out to be the posterior mean of some Gaussian process scheme. For optimization, we won't see because there's more interesting stuff to learn after Christmas, but it turns out that various classic smart optimization rules, like basically all-qualsy Newton methods, so BFGS and all its variants are least squares estimates, so they arrive from some Gaussian estimate over the Hessian of the loss function. For ordinary differential equations, Natanayl pointed out in passing that there is a way of thinking about specific Runge-Kutta methods as very specific Gaussian posterior estimates over the solution of the ODE within the individual step. And for PDEs, Marvin, even though I'm not sure everyone got it, but he basically pointed out that you can think about a very large class of PDE solvers, generalized Galerkin methods, also as Gaussian process inference. So in all of these cases, the classic methods actually turn out to be Bayesian estimates. They're just Bayesian posterior estimates. And similar to how you've learned in probabilistic machine learning, maybe that Gaussian process inference is the right way to do least squares inference because it's least squares as the posterior mean plus uncertainty estimates. And those uncertainty estimates come for free because they require quantities that you've already computed for the posterior mean. So if you have them, why not use them to get uncertainty? In the same way, numerical computation in the classic sense, should maybe be thought about as my personal opinion, of course, should be thought about as Gaussian inference because it's least squares for numerical quantities with uncertainty for free. So now let's come back to this uncertainty that we've just computed and think about what we can do with it and why we should care about it. And here I said we should think about various properties of this plot. We'll first start with something, well, maybe unquantiversal, let's see. So in this plot, I've decided to use a regular grid. So from one step to the next, I just kept, well, actually from one step of this plot to the next, I doubled the number of steps because then it's a regular step in a log log plot, but in each case, I left the distance between the evaluations constant. So they are all equally far apart from each other. So you could wonder whether that's the right thing to do. Where should I put those evaluation nodes? And that's actually opening up an interesting aspect that numerical algorithms aren't just machine learning algorithms in the sense that they estimate from data. They are actually more like AI algorithms. They are more autonomous agents because they get to decide something. They don't just get data and then operate on it. They actually decide which computations to perform. It's in a somewhat boring way, right? It's just why do I put those little points to make the evaluations of the function, but we still need to make this decision. And you can answer this question actually in the same way that you would answer it for a Bayesian inference problem. You could say here is a probability distribution. I would like this probability distribution to converge towards certainty very fast. So I need to find a quantity that describes certainty or uncertainty and then manipulate that. So what's a measure of uncertainty for probability distribution of information content? Entropy, yeah. So some of you are in Bob Williamson's class, I guess. So you've learned about entropy and you know that the entropy of a Gaussian is up here. The entropy of a Gaussian multivariate distribution is given by basically a constant that grows with the number of evaluations plus one half times the log of the determinant of the covariance matrix. Four Gaussian distributions that can be written as basically Kalman filtering. So as linear time invariant systems, there's a nice way to decompose those that turns out in this case to give us this answer. So that is skipping a few lines of non-trivial algebra, but you can just believe me that it happens to be the case that the variance of this posterior distribution over the integral is given by this number. So actually, yeah, I made it sound more complicated that it is, this is a univariate Gaussian distribution, right? It's just the integral capital F is the thing we don't know. So it's just one number. So the distribution over it is just a Gaussian bell curve. So it has one quantity which is given by the variance and that is this thing. And that's the thing we want to minimize, right? So that this defines the entropy. So if you want to make this entropy minimal, we need to minimize it. Well, actually the logarithm of this defines the entropy, but the logarithm is a monotonic transformation. So if we minimize the variance, we also minimize this logarithm or the other way around, right? So to minimize this expression, well, it's just a function of all the deltas. So one first thing we can do is we realize that maybe we want to have a design, so locations of the evaluation that start at the left end at A of the integral A and end at the right end of the integral at B. So that tells us what delta N minus one needs to be the last step. It needs to be the width of the integration domain minus the sum of everything you've done so far before. And then we have a function that has basically now depends on all of the remaining delta I. We can take a gradient of this object with respect to these delta J's, which are less than N minus one. You can do that for yourself. You get out this because there's a cube here. The three comes down, squares are left, right? The 12 becomes a four and we have those expressions. And now we want this gradient to be zero because that's a necessary condition for a minimum. We set that to zero and we find that all of the delta J's have to be equal to the delta N minus one, so they all have to be equal. So if we want to minimize the posterior uncertainty over the integral under this kernel, and we assume that we want integration nodes on both ends of the integration domain, then we should have a regular grid. So this regular grid doesn't just come from intuition. It's not just, oh, that's what sort of looks nice, but it actually can be motivated from this prior over the integrand. And now you might have two questions. I think Julia is already thinking about one of them. One of them might be, well, why do I need evaluations on the left and right end? You don't have to. You could leave the left end open, actually, or the right end. And that gives you a different design, actually. If you really want to know, you need to check in my book, there is the answer somewhere in the margins. It's an ever so slightly neatly different answer which takes into account that we have this prior Gaussian process, which is broader on the right-hand side and on the left. So it's going to put a few more evaluations to the right than to the left because there's more uncertainty there. But once you evaluate on the right once, it's gone. So it doesn't matter anymore. And a second question you could have is, or maybe just more of an observation, is that this is, of course, an answer that is very specific for this Gaussian process prior. So if I had some other prior that I can run this Bayesian quadrature algorithm on, then I might get other designs. In particular, if I have a structured prior uncertainty about the integrand, if I know something about it, that it has some particular algebraic shape or that the uncertainty is distributed in a non-trivial fashion across the domain, you might typically get quite different designs, actually. So the technical takeaway is we have now understood maybe why we choose a regular grid, but maybe more importantly, we really should think about computation as the action of an intelligent agent. It doesn't have to be super intelligent. It only has to solve this simple problem, but you can still motivate how it acts from the point of view of Bayesian inference. So there's a policy that this agent follows, like in reinforcement learning, and this is this very simple policy of maximum information gain. And here in this case, we can do this in closed form and the answer is somewhat boring, it's just a regular grid, but there are other settings where this is actually a really non-trivial thing to do. So that is about where to put the evaluation nodes. So what we now know is that we can think of the trapezoid rule, this classic thing from high school, as a Bayesian estimate, actually the posterior mean of a Bayesian estimate that arises from a Gaussian process prior, a very specific Gaussian process prior, over the integrand, and if we decide to use a regular grid, then we can actually motivate this choice by an information theoretic argument. So the entire algorithm is sort of derived in a self-consistent fashion from a set of prior assumptions. And now, since we've realized that we are doing this, we can come back to this plot. Actually, I'll show you, I'll go back to this plot for a moment. Sorry that I keep hopping around so much. Actually, maybe I'll go to this plot. And look at it again and realize, okay, this is kind of neat. So we can think about the trapezoid rule as a Bayesian estimate. So why is this good, first of all? Well, that means that we get on top of the estimate that we already know, this trapezoid rule, which is not new. So this convergence rate here shouldn't really be surprising because it's not a new algorithm, it's hundreds of years old. But we get something else on top, which is this Gaussian process posterior variance, right? And the posterior variance is maybe an error estimate for the integral. So we get an error estimate and we actually get it for free, basically, in the same way that the posterior covariance is for free in a Gaussian process because it only requires quantities that we've already computed anyway. In that, for example, if you implement it as a filter, you just get it out as this P, the big P over the integrand at the end. So in this plot, I actually have plotted this error estimate and it's this red dashed line up here. And that's a bit underwhelming, no? It's way too high. It's way above the actual error. And it's not even just above the actual error, it actually contracts at a rate that is less than the actual error. So the further we go to the right, the wider this gap becomes between the actual error and the estimated error. So we'll talk about that in a moment, but maybe, so a natural reaction to this might be, oh, this whole algorithm is stupid, right? That estimates the error really badly. But keep in mind that this is the trapezoid rule, right? So if you think this is stupid, the trapezoid rule is stupid. And it's been around for, I don't know, 200 years. So it can't be so totally bad, right? So maybe there is something about this algorithm that isn't quite right. And if you can understand that, then we're not necessarily finding out that the Bayesian view is bad. We might just find out that the trapezoid rule has some properties that we might want to question. Okay, so let's do that. So it turns out that we've already seen now that this variance estimate, I've now written it down a few times is, well actually on the previous slide it was, this is the variance estimate. So our error estimate, of course, the variance is the expected square error. So our error estimate is the standard deviation. It's the square root over this. So let's write it down. It's this expression. We've decided that we're going to use constant step size because its maximum information gain was the best thing to do. So all the deltas are the same. Then what we have here is essentially a sum over n minus one terms, where each of the elements of the sum, the summons is just something like one over n cubed. So it's actually the width of the integration domain, b minus a cubed divided by n cubed. Because if I set delta to a constant, that means that I have to cover the distance of b minus a in n steps. And that's exactly this, right? It's just b minus a cubed over n cubed. So that's nice because it makes this sum easy to think about. These are just all constant within the sum. There's no i in here, right? So that means what we have here is actually pretty much n, precisely n minus one copies of this number. So that's like one n cancelling basically, right? So it's one over n squared or o of one over n squared. And then there's a square root here. So that means this quantity drops like one over n. And you saw that in the plot, right? So the line dropped like one over n. That's actually what this does. But I already brought it into the plot that the arrow actually drops like one over n squared. And that's actually a theorem. So you can look that up. You have to dig around a little bit and like some basic numeric classes, basically. It's often not even properly discussed because it's so straightforward. But the convergence rate for this trapezoid rule is o of one over n squared. Why? Because you can actually think about, so one simple hand-wavy way to do this without a proper proof is what we're doing is we are approximating our integrand by piecewise linear functions, right? So if you think about each of these linear regions where we have a linear line between the start and the endpoint, then within that region, we have a linear interpolation for our function. But what we are actually estimating is the integral of that function. So that's like a quadratic approximation to the integral on that domain because the integral is like one order higher than the function we're approximating, right? So that means that the error, if you think in terms of Taylor series and Wallace theorem, then the error is something like o of step size cubed, right? And there are n of those regions. So in each of these regions, we have an error that is cubic. So we add up those n times n cubed, like one over n cubed, and so we end up with one over n squared, basically. So that's the actual error. So that means our beautiful Gaussian process estimation rule somehow gets the error wrong by one order. It just assumes that the error is much worse. But there's a theorem that can tell it that this is what the, that's your error, actually, right? So why is it so overly cautious? There's a mathematical statement that can guarantee that this holds. Shouldn't it be possible to just, you know, make sure that this mathematical property holds? So what's maybe, like, if you read this theorem, did you notice some hidden assumption in it? Yes? Yes, so the statement says, if f is like little f, is at once continuously differentiable, then, but we are using a Wiener process prior. So this trapezoid rule arose by us saying, well, let's assume that our integrand comes from this Wiener process prior. So we did that. We sort of brushed past this. I just said, well, let's use whatever Gaussian process, you know, like the first one that comes to mind, the Wiener process, or the Matern, whatever, one half, which is very similar. And then, we get our algorithm, right? So in that process, we kind of forgot for a moment that, of course, by writing down a prior, you actually make an assumption. And it turns out, so this is a theorem, you'd have to look up, that sample paths drawn from a Wiener process, they have this extremely rough behavior. They are non-differentiable almost everywhere. In fact, they can even be shown to not even be Lipschitz. They are sort of almost Lipschitz, but not quite. So they are really, really nasty. They are what's called Hülder continuous of lowest possible order. So they are sort of super rough. Like, if you zoom in, they have this self-similar, fractal kind of property that on every length scale, they look the same and always this very rough. And so what this means is our integration algorithm, we told this algorithm, well, you're like a sort of meanie way, what's my purpose? Well, you integrate very rough functions. That's your job, right? I've told you, your prior assumption is this function we're going to give you is unbelievably rough. It's non-differentiable almost everywhere. And then what this algorithm says is that, well, okay, if that's the case, then I have to be really cautious with my estimate for what the integral is going to be. And what this thing does as we move forward is it keeps discovering that those numbers we get from this function, they are way more regular than it expected them to be. And it keeps making this observation at every resolution because it's prior is rough at every resolution and the function is increasingly super smooth as you get closer and closer. So if you want to fix this, then we might need to use a different algorithm that assumes more smoothness. Or another way to think about this is if you're using the trapezoid rule, you're throwing away computational efficiency because you're using an algorithm that is designed to operate on super rough functions. And that means it can't be as good as it could possibly be if it would know that the function is much smoother. So without even talking about how you would do this, I'm gonna get to that in a few minutes, this is, there's actually something hidden in there as an insight about numerical algorithms. That, and it's more of this sort of social structure of numerical computation. So who writes actually numerical algorithms? Well, ideally numerical analysts, right? So a PhD student doing a math PhD and applied math PhD, right? So what is their goal? Well, they often don't actually need to solve one particular problem. They need to solve, they want to have something they can publish, right? You want to come up with an algorithm that works on many, many problems. See, you're kind of incentivized to come up with algorithms so which you can prove a statement that says for every function that fulfills following properties, this is going to work. And that naturally leads to overly general algorithms. Once where you're quite sort of careful to make sure they work on a large class of problems. Maybe not always on super rough functions like this, but you know, you're gonna make some assumptions about the function that are typically quite cautious. And when you do that, that's this doesn't actually come for free. So algorithms that work on a large class of problems tend to work not particularly well on any individual one of them. And that means if you have one concrete problem in your career at some point that you want to solve on the marital problem and you know it's just this particular problem, then if it's sufficiently important to you and not just something you need to do in the afternoon, it might be worth actually designing one computational method for this particular problem. And you can do that once you understand how these algorithms work. We'll get to that again at the end of the lecture to also think about what that says about Monte Carlo. Before we do that though, I want to show that we can at least fix a little bit of this problem. So one, like the simplest aspect of this miscalibration of the error is that there is also a scale to this error which in this plot I've just set to one actually. So this red region, it's just the standard winner process, right? So it just opens up in this sort of standard fashion with a rate of one per one and invariance. And that of course isn't necessarily correct. So we just happen to be lucky in this case because the integrand we're looking at, this black line happens to be on a scale from zero to one. So this plot looks good. But of course you could have a setting where the black line is much more like dynamic or much less dynamic actually. This corresponds to the setting in simulation methods in ODE solvers where the dynamic of the vector field can be quite smooth or very rapidly changing. And then you need different step sizes maybe. So can we estimate that scale while the algorithm runs and how expensive is that? So the answer to this is yes we can and we can again use ideas from your probabilistic machine learning class. So if there is, let me check if you can remember what you learned last term if you took that lecture. So if you have a Gaussian distribution with an unknown scale, an unknown constant in the covariance and you get some data from it, how do you estimate that scale? Evidence optimization, so that's the generic, that the general thing, if you want to learn the hyperparameter of a Gaussian distribution you compute the evidence so the marginal probability for the observations under the prior and then you optimize its parameters. That's true, that's the general answer. In general this is also expensive because it's a non-linear optimization problem but if it's just a scale, if it's just a scalar number then there's actually a closed form way of doing this. Have you heard of conjugate prior inference? No? Okay, let me tell you the story that is a party story everyone likes. If you haven't heard it yet then you get to be one of today's whatever. So, okay, so how do I make this in two minutes or three minutes? For probability distributions that are exponential family distributions so they can be written as an exponential of something near a form there is always a conjugate prior. So, a probability distribution for its parameters, in this case our parameter is going to be this theta in front of the kernel that we don't know such that the product of, if you think of your observation as a likelihood and you multiply by this prior for this parameter you get out an object that is of the same algebraic form as the prior. For Gaussian distributions and the standard deviation of those Gaussian distributions that conjugate prior happens to be what's called the gamma prior, the gamma probability distribution over the inverse of the variance, the precision of the Gaussian distribution. And so this is a complicated technical statement but it's related to a nice story about this guy, he's called William C. Legosset and the kind of people who know his name, by name tend to be statisticians. Has anyone heard of him before? No, okay. So he used to be, he studied statistics, if I remember correctly at Cambridge and then he went on to become a brewer, a master brewer of the Guinness brewery in I think initially in London and then later in Dublin even and he had to, so brewing back then was important business and he had to make sure that this is a chemical process that it worked well, right? So he needed to take samples from the beer barrels and keep estimating how they are distributed as samples from some distribution, he assumed a Gaussian distribution and he had exactly this problem that he didn't know what the variance of his sample was. But he had studied with Pearson I think for his degree and had learned a little bit about how to do statistics and he came up with this beautiful way that nowadays we write like this, so a gamma prior over the inverse variance of the Gaussian and then conjugate prior inference at bats he was working for Guinness and it didn't allow him to publish papers. So it's a little bit like today that sometimes people working in industry aren't allowed to publish their work because Guinness wanted to keep it a bit of a trade secret. So he published under pseudonym, he called himself student and so the distribution that he came up with which is the marginal, the integral over the Gaussian times the gamma integrated against theta is called the student distribution because he couldn't use his proper name. It's often called the student T distribution because he defined a statistic called T so it's one of these distributions that stupidly are named after a variable that is completely arbitrary just like the gamma distribution. So in numbers this means to keep track of our estimate for theta, we start with a prior over theta inverse squared which is of this particular algebraic form. I haven't even written out what it actually looks like. You can look up on Wikipedia what the gamma distribution is. It just happens to be the right algebraic choice such that the following tricks works. If you multiply those two together, so a Gaussian distribution over some numbers with a variance given by theta squared times a number. Multiply by this thing, then the product of this up to normalization is again of the same form as this gamma distribution with new parameters and those parameters are the original parameter plus the number of collected variables times a half and the original other parameter plus a statistic which is also called the sufficient statistic that is given by this object. So it's the observations or function values transpose times the kernel gram matrix inverse times the observations. The only really important thing about this is that A, this is a quantity we already have computed anyway because we need it for our posterior mean. So there's no extra computation to get to it. Nice. And secondly, the other quantity is also easy. It's just keep track of how many numbers you've collected so we already know that as well. So we can keep track of these two objects simply by adding up numbers that we locally collect for free pretty much, right? The only thing we need to do is add up floating point numbers, two floating point numbers, which is really whatever. So if we keep doing that, then we always have a posterior estimate of what theta is and it starts out very broad and then it becomes more and more concentrated as we get more observations and the associated distribution over this parameter is called the gamma distribution. The associated distribution over the integral is called the student T distribution. If we do that, then we get the following plot. So here is the initial prior again and now at every point in time, I show you what we would have got if we would have kept the scale constant in thin gray in the background and the actual estimate for the student T marginal in red. And so you see initially, we get an observation that's a little bit too small that this number here and this number are a little bit closer to zero than what the prior expected. So we shrink a little bit. Then we get the next few observations and now actually the distribution kind of looks good because at this resolution, the steps from one point to the next are kind of what the prior expects. So we are nearly at the same scale. Now we keep going and we keep observing. Now we're beginning to get into this domain where the function happens to be actually quite regular. We get higher and higher resolution observation of the function and we keep seeing that the numbers we see from one low to the next tend to not deviate so far from each other and this sort of distribution starts to contract. And so on the right, you see the original estimate that we had on previous plots and the new one in red dashed and you can see that it's still not super convincing but it's a little bit better. It's actually quite a lot better in the sense that asymptotically this contraction rate here will match this contraction rate. So they get to the right order but this gap is not going to close. And what they sort of keep saying is that we're using a prior that is way too conservative for this problem. We know that this function is smooth and in fact, actually we do know. So if you started out this lecture by looking at this expression, f of x is x, you could have looked at this and just said, well, that's a smooth function. There's a sign in there, there's an x square in there. There's no point where it diverges. There's no root of the, there's no point where the gradient is undefined. It's a smooth function. You can also run automatic differentiation on it if you wanted to do this with a computer and keep discovering that it's super smooth. So we actually know. This isn't like your prior that you have in statistical machine learning, right? Or a Bayesian machine learning where you just go in and say, I don't know, like the world is complicated. I'm gonna use a smooth prior because what else am I gonna use? Here in computation, we actually know what the problem is. It gets given to us as a string in a formal language that describes the problem precisely. So we can check some kinds of prior assumptions. For example, the fact that all derivatives of this function exist. And then if we can do that, we might want to use it and break away from this trapezoid rule to use something smarter. And in fact, it turns out there are, of course, much smarter algorithms. And if you have used an integration method in univariate problems like this before, you might know that there are nice packages for integration. In Python, you find them in scipy.quad for quadrature. And because they're all these square estimates, they all correspond to some Gaussian process prior. In particular, there's a class of algorithms called Gaussian quadrature. They often come as Gaussian something quadrature, Gauss Legendre, Gauss Chebyshev, Gauss Lobato, and so on. They all correspond to some least squares estimate over a particularly structured prior. And I don't have time to tell you exactly how they work, but I can show you that they work really well. So here you see on the left, a particular set of features in green in which it's a good idea to expand these functions. These features happen to be Legendre polynomials. So the Legendre polynomials are a very interesting class of polynomials which are smooth, so they're always infinitely often differentiable. They are orthogonal to each other so that makes certain kinds of computations very fast, very convenient. You can think in some sense of some gram matrix being diagonal. And they are defined on a bounded domain from minus one to plus one, which of course you can scale to minus three to plus three by just taking all the numbers that come in and dividing them by three. So they are a classic way of doing certain kinds of integrals on bounded domains that are multiples of minus one to plus one, like ours. And there are corresponding quadrature rules which I'll tell you more about in a moment which converge extremely fast. So here on the right in green, you see the error of this corresponding integration estimate and you can see it converges very, very fast. In fact, as a suggestive line without a proof or any statement, I've added in this bend line which is an exponential function. Just to point out that this is so fast, you can almost think of it as exponential convergence. It's a super polynomial rate of convergence. So, and this only works though, so this nice convergence you only get for functions that are actually smooth. It's just that our function is smooth, so it's good. And it turns out that this estimate, this green line that you see here actually also corresponds to some posterior mean estimate under certain kinds of Gaussian prior assumptions. The paper for this was written by Tony Carbonin and Simo Sake in 2017. I've got the abstract down here which I'm actually going to read out because it contains a few warnings. It's a bit small, so if you can't read it, it says that's the actual abstract from their paper. In an extension to some previous work on the topic, we show how all classical polynomial-based quadrature rules, this green one is one example of them, can be interpreted as Bayesian quadrature, the stuff we've been talking about today, if the covariance kernel is selected suitable so if we find the right prior. As the resulting Bayesian quadrature rules have zero posterior integral variance, the result of this article are mostly of theoretical interest in clarifying the relationship between the two different approaches to numerical integration. So what they actually found is that to make this work, to get this green line, you have to do something that seems quite unnatural, which is that you have to ask the algorithm how many evaluations are you going to choose? And then I'm going to give you a kernel that is fitted specifically to this number of evaluations, which happens to have the property that exactly after that number of evaluations, the posterior variance over the integral is zero, even though it's wrong. And if you wanted to add an uncertainty, you then have to add more terms afterwards, which are going to mess up the posterior mean estimate. You can see that that's actually a problem in this plot, so what I've done here on the left is in red, I've actually computed one such estimate for the integrand. So that's this red line that keeps going up and down. This is for five evaluations. Here you see the five evaluation points, one, two, three, four, five. You can see that those points are irregularly spaced. Well, they're somewhat regular, but they're not uniform distance. That's because this is a non-trivial Gaussian prior. And you can also see that the interpolant looks absolutely nasty. Doesn't look anything like the black line. It kind of diverges badly as well to the left and the right. But it happens to be a quite decent integral estimate. So the integral over this red line happens to be a point on this green curve. So this is really a set of features to expand in which are not suitable for learning the function particularly, but they're quite suitable for learning the integral. They have this nice convergence rate. So if you really wanted to understand this better, that would be, I don't know, I'm nice master thesis on the side because it's not really a solved problem in the sense that it doesn't provide super good uncertainty quantification. But I'm showing you this plot to point out that there are classic algorithms which are very good for this kind of problem. And they happen to be actually come with lots of structure with different bases for different kind of integration problems. And since we're talking about integration today, you might as well know about them. So if you're using one of these quadrature rules that come in sci-pi.quad, then there are actually several different ones to choose. And they are designed to work on particular kinds of problems that depend on the integration domain and the shape of the function you're trying to integrate. So what they do is they approximate the integral, this thing we care about. Well, they assume that it can be written like this where phi is a set of orthogonal polynomials and w is some weighting function. And then there are specific choices for phi depending on w. And A and B. So if the integration domain goes from minus one to one, like in our case, then there's in particular the one that I used here, called Legendre integration. And then there's actually a whole set of different polynomials by Chebyshev. In particular, there's this one, but there's actually more than that. There's at least three different kinds of Chebyshev polynomials for different kind of weightings. So they are basically a statement about where the function varies, right? Whether you expect it to be more varying in the corners or more in the middle. And there are also variants for indefinite integrals or the integrals over the unbounded domain, which are hermit polynomials, there are various other ones, there are gigabower polynomials and so on. So if you ever have the need to compute an integral numerically on the line, a univariate one, then think briefly about what kind of integral you're trying to compute. What are the integration domain? What's the shape of your function? Why? Because you're specifying a prior for your Bayesian inference rule. And if you get it right, you need less numbers. And if your problem is particularly nasty, you might actually really have to think about it because otherwise the algorithm might not work well. Thankfully, because this is a simple problem, integration in one dimension, these algorithms are quite good at telling you when they are probably wrong. So they have good convergence estimates and they throw errors and flags and so on. Okay, yes? You mean the fact that the red line, that this red line here has to spend and then becomes like this? So, well, it has a reason and it has something to do with this very specific integrand. So what you see here is the fact maybe that this integrand has this global structure that has something to do with a period of three over pi because it's sine of something, right? So it wobbles up and down. And then beyond that, as you zoom in more and more, it basically is just a straight line in every region. And that's sort of this first region here is where the algorithm discovers this shape basically. And then after that, it does something very boring. It just keeps extrapolating straight lines by straight lines. But they are ever so slightly not straight but they are always basically not straight in the same way on every scale. And that's what you see here. So this is something about the particular problem that we're trying to integrate. So now, for the final few minutes, let me briefly summarize this maybe a point where I should have had a black slide which says there are these classic integration algorithms, classic quadrature rules, these Gaussian quadrature algorithms, Gauss Legendre, Gauss Chebyshev, Gauss Kroneward, Gauss Gegenbauer, Gauss Hermit and so on. Also the trapezoid rule and various other kind of integration rules. Pretty much every integration algorithm for one-dimensional functions that you've seen before in high school or in a numerics class can be thought of as some form of Gaussian posterior mean estimate because they are derived as least squares estimators. And we can motivate choices made by these algorithms in particular also where they put their evaluation notes by information theoretic arguments by minimizing the entropy of the posterior distribution. Actually in the case of these orthonormal polynomial integration rules, this is particularly extreme. There the evaluation notes are chosen such that the posterior variance is zero so that the entropy goes to zero. Or basically it's undefined, it's completely removed. Right, but that's just one-dimensional functions. So what we did now so far was mostly think about integration as some kind of philosophical problem, right? There's a function, you want to interpolate it, learn its integral, but who really cares about one-dimensional integrals, right? In reality, if you actually encounter a one-dimensional integral, you now know what to do. You go scipy.quad.integrate and it just does it. And unless your integrand has a super nasty shape or it's on the unbounded domain and there's nasty shapes in certain regions, it's probably going to work. But the interesting settings are for higher-dimensional problems. Beyond dimension one, well actually there are also classic quadrature rules for dimension two and three. They're called cubitural rules. But beyond three, things get really complicated. And I told you last week that, actually, I have a slide for this, right? Aha, I showed you this. That Monte Carlo algorithms converge like one over the square root of the number of samples and that's in some sense good and in some sense bad, bad because one over square root looks bad, but good because in the expression there was no obvious way in which the dimensionality of the problem showed up. You can construct situations where dimensionality matters and in many cases it doesn't. So the methods that I showed you today actually do not scale well in dimensionality. They tend to have this property that their performance decays with the exponential of the dimensionality. Why? Well because you pretty much have to produce a mesh of evaluations such that in each dimension the distance between the points is in Euclidean sense is smaller than some number. And to make that number small, this typical scale, on a grid of D dimensions you need to put exponential of D points on there, right? So if you have a one dimensional grid and you want to have 10 points between left and right end you need 10 points. If you need a two dimensional grid and the projection on one of those dimensions should have 10 points each then you need something like 100 points, 10 to the two. And if you have a D dimensional grid you need something like 10 to the D points, right? So that's bad. And this has something to do with the fact that we're trying to interpolate a function on the whole domain. It also has something to do with the fact that we're using Gaussian processes as our priors. So remember that the Gaussian process posterior covariance looks like this and that's the thing we're integrating. Notice that this expression in its nice linear algebra form doesn't contain any y's. There's just x's in there. So the posterior uncertainty of a Gaussian doesn't depend on the numbers you've seen. It just depends on where you have evaluated. And that means that these integration methods are, by their nature, non-adaptive. They can't reason about where to put evaluation points because there is nothing in their entropy in their formulation of information gain that tells them that some points contain more information than others. And that means they scale badly to higher dimensional settings because in higher dimensions they just have to cover the domain with points. So to get away from this we need algorithms that somehow can reason about the fact that some points are more informative than others. And that is actually often the case for integrals specifically in machine learning because in, actually, yeah, I have a right slide for this. Our machine learning settings I said last week as well the integrals we tend to compute tend to be expected values of some function under some probability distribution. That's actually the reason why Monte Carlo works in the first place. So for Monte Carlo we always have to compute some integral over f of x dp of x where p is a probability distribution. And probability distributions integrate to one. So on a big domain they can't be large, non-negative value everywhere because then they wouldn't tend to integrate to one. They typically have what's called a concentration of measure so they tend to be large in some area and then almost zero everywhere else. And actually the example we've been looking at today for the entire day is of this form as well. So we can think of this function. Now here's the plot that I might have shown you in the beginning. Our f of x is equal to x of minus sine square of three x minus x square. We can think of this as a product between the Gaussian x of minus x square and this x of minus sine square of three x function. If you multiply those two together these two gray lines you get the black line. So this is a little bit like, well actually it's sort of a prototypical example of a situation where you have a prior, in this case the prior is a Gaussian distribution. Multiply it by some likelihood function which in this case is this exponential of sine squared function. And you know that likelihoods don't have to be probability distributions in their parameter. So if you think of L of x as a likelihood so the probability of observing some data given x then this L doesn't have to be a probability distribution in x only in the data. But pi has to be a probability distribution in x after normalization. And that's the situation we have here. So in a situation like this if I asked you for the integral over this curve you can sort of intuitively say, well so first of all I know that this function is non-zero everywhere because it's a probability distribution. Sorry, not non-zero, non-negative. So it's larger or equal than zero. And secondly it has to be, it can only be large in some like contained area. If there were a possibility for it to come back up again far away then overall it would have to be lower. Right? And that means if I start somewhere let's say here and keep seeing large values then that might be a good reason to stay in this domain for a while. And that's actually exactly what Markov-Chamontekalo does. Right? It moves around and if it finds larger and larger values it spends more time in them. So we need an algorithm that does something like this. And one first problem that we have with our Gaussian processes so far maybe the first thing we could fix is that, so notice that Gaussian distributions are distributions over the real line so they always put non-zero probability or negative function values. They have to because at every point in X there's a Gaussian distribution over the function value and the Gaussian has full support, right? So they have to allow for function values that dip below zero. But we know that this function is larger than, larger or equal than zero everywhere. Actually this one is even larger than zero everywhere. So we would like to encode that and you can't encode this with a Gaussian. One way to encode that in a sort of pedestrian fashion is to say let's take a Gaussian process and then square it. So we take functions that come from a Gaussian process and then take their square because the square of a real number is always larger or equal than zero. So here I've actually done that. So I've done something almost like this. So I have defined a new function L, sorry, a new function L tilde which is implicitly defined through this equation or explicitly through this one. So it's given by the square root of two of the actual function L that we're trying to regress on minus some constant. The constant is just like something like you know that captures the fact that this function doesn't come down to zero. So here I've set it to 0.1 but we could have set it to 0.4 as well, whatever. So that's a way of saying this function L, so this part of our integrand that we care about is larger than zero everywhere. And we can't do closed form Gaussian inference on this but we can do something almost like it. So I can take, and here you have to trust me that it's just an intuitive idea. Whenever I run the algorithm, I take the values from our gray line here L and because I can evaluate them without noise because it's just a function that I can evaluate on the computer, I can evaluate the corresponding value of L tilde. It's just this computation which is cheap. And then I just use this as the training data for my approximate Gaussian process. And then when I reason about the thing I care about L, I have to do some kind of numerical trick to reason about what the corresponding square of the function L would look like. And there are various numerical tricks of doing that approximately. And then the other thing I can do is I can say I know that I don't actually care about L, I care about the integral of L under this Gaussian prior of minus X square. So I just multiply these two things together to get a belief over what the function might look like that I'm trying to integrate. This is called warped Gaussian process inference because we warped the Gaussian processes by squaring them, the samples from the Gaussian process. The corresponding approximate distribution over those function values, over these orange lines is not a Gaussian distribution. It's actually, I mean the correct distribution is not Gaussian and its approximation to it is some stochastic process that can be approximated by a Gaussian process where the covariance depends on the observed function values. It actually depends on the posterior mean. And that can be implemented and it gives rise to an algorithm that's called wasabi for warped sequential active Bayesian integration. You don't need to understand exactly how it works but the basic idea is the final thing I'm going to tell you is it's possible to build approximate numerical algorithms which are based in a probabilistic formulation which adaptively add uncertainty where they think they're going to see large function values. And here's the plot for what this looks like without understanding what the math actually is. So we start out with our approximate distribution over this L evaluated at once some first point. In this case, the point is at zero because we have a Gaussian prior that says it looks lies in this bell shaped region so we might as well start at the mean. It's a good point to start. We evaluate, we get one point we get an interpolant for this function. And then the posterior uncertainty over those function values based on thanks to this not further explained nonlinear approximation of or approximation of a nonlinear transformation of a Gaussian gives rise to this utility function in blue which is some form of uncertainty measure. And you can see that it adapts to the shape of the observed function. What this says is I have seen a function value here so therefore I'm not uncertain at that point because I've seen a function value here. But right next to it, I believe that there are more positive values and I care about positive values because I know that the function typically is quite small. So in this region close to what I've already seen I might want to evaluate again. So wherever this function is maximized which happens to be this line, I evaluate again then I become certain around this point. So now I shouldn't be looking here anymore because I know the function here quite well but over here I'm still quite uncertain. So let's choose where to evaluate next. And you notice how this is like some active learning procedure, right? You keep observing and then you adapt to what you've seen and this algorithm will actually start to explore and then build models for now it has observed here now it bends in there starts filling in this bit and now slowly it has seen sort of the large part of the integration domain that has sort of covered this. And now it will start once it has covered down to here it will start coming back in here and refining this estimate moving around. Such algorithms don't just this particular algorithm doesn't just work for one dimensional problems. It actually works in higher dimensions as well. And it can even be shown to have some non-trivial e-smart conversions rate. There was a complicated paper in Europe's 2019 pointing out that such adaptive quadrature rules can have a convergence rate that isn't totally stupid. So this particular algorithm I just showed you even though it was very hand-ravely constructed by just saying now we're gonna construct this thing actually can be shown to have some properties that are not so bad. That's the weak thing I'm going to say it's a very complicated theorem. And one can argue about whether this is even a good rate or not but they are convergent they are consistent algorithms that actually converge. So long story short it's possible to reason about algorithms that adapt their behavior in the same way that Markov chain, well not the same way but in an analogous way to how Markov chain Monte Carlo does it. And these algorithms scale to problems of reasonable dimensionality, 10 or 15. If you have 2,000 dimensions well then you're not gonna cover them anyway. And then you can use Markov chain Monte Carlo and it's going to be roughly as good as these methods because there's no structure to use. You would need to use something like 10 to the 1000 samples to even roughly cover the space and Monte Carlo is not going to do that either. But actually after Christmas we'll find out in around late January that in such settings there's something actually even better than Monte Carlo to do which will not necessarily have something to do with patient quadrature. The high level philosophical statement for that I'm actually going to cite my wonderful co-author Mike Osborn is that one should be careful about why Monte Carlo is actually chosen. So Monte Carlo algorithms can also be associated with the Gaussian process prior as the theorem here. I'm gonna save you this because I'm out of time but the point is that the argument for Monte Carlo is that it's independent of dimensionality in its convergence rate but its convergence rate is actually very bad. It's one over the square root of the number of samples. And we have to use this convergence rate because we kind of insist on having an unbiased estimate asymptotically. If you're willing to give up unbiasedness which last week I argued is artificial anyway because it's only a concept meaningful once you use random numbers and there's no need to use random numbers for deterministic problem then you can actually get algorithms that converge way, way, way faster in low dimensions even super polynomially fast extremely efficiently. So as Mike says, if you care about the fact that Monte Carlo convergence rate is independent of dimensionality that is like saying if you want your nail hammering to be independent of wall hardness you should do your hammering with a banana. If you use a really bad tool then that tool is going to work equally bad on every problem but that's not a good thing not a good property to have. Actually the fact that Monte Carlo is independent of the problem shows that it uses a way too weak prior assumption set of prior assumptions and if you want to have an algorithm that works well you typically have to introduce some prior knowledge to make it work well. And that's not just true for integration it's also true for other kind of numerical problems. The more you know about your problem the more you can inject. Yeah and we'll talk more about this after Christmas at this point I'll end. I've used integration as a abstract philosophical example to think about what it means to compute something. We spoke about one concrete particular problem and realized that you can think of the computation in a Bayesian fashion by putting a prior distribution over the quantities you care about in this case the integral and the things you can compute in this case the values of the integrand. The corresponding algorithms assign a posterior estimate that can be identified with classic methods but they assign an uncertainty estimate around those estimates, those point estimates which arises from prior assumptions and the numbers you've got to see. Sometimes those error estimates are really bad they're very far away from the truth and when that's the case that doesn't necessarily mean that the probabilistic view on computation is wrong it actually typically just means that the set of prior assumptions are bad and that you can maybe fix them by using typically more prior knowledge and such prior knowledge is often available in computational settings because we know what our problem is it's encoded in computer code for us so we can look at the formal description that the code provides and try to use to extract more information to make the algorithms one faster. And we can think about these numerical algorithms actually as little autonomous agents. If you're building a big machine learning solution some agent that I don't know drives a car that's some complicated decision when it solves numerical problems whether it's training a deep neural network or computing some Bayesian posterior somewhere inside there is actually like a tiny little autonomous agent inside of your big autonomous agent that does the computation that decides which numbers to compute and by informing that autonomous agent about its task you can make it work better. And maybe that's where we end for today. Thank you.