 All right, welcome everybody. Welcome to the fourth lecture on numerics of machine learning. Last time, we took a closer look at iterative approximation algorithms for Gaussian processes. Today is the culmination of this first sequence on Gaussian processes. If I told you that today, you'll see how to do exact uncertainty quantification in arbitrary time. So no matter how much computation you put in, whatever your budget is, you can still quantify how far away you are from the function that you're trying to learn. Okay, so we'll take a look at that. And today will be the first real lecture on probabilistic numerics, actually. So last week, we saw kind of the contemporary way of solving linear systems in the large-scale regime, specifically applied to Gaussian processes. So we saw that how GPY torch works internally, very popular library on Gaussian processes in Python. And today, we'll put our probabilistic numerics hats on and reinterpreted how these algorithms that we saw last time actually are learning agents. And we'll see a very concrete benefit of this view that we can also quantify the approximation error. Since we're approximating, undoubtedly there will be some error introduced into our posterior, into our prediction. And we'll see that with this view, we can actually keep track of this error exactly, okay? So we'll also see a very interesting, almost philosophical debate about what it means to observe data about some latent phenomenon, let's say, trying to learn some weather pattern. You model it with a Gaussian process. You wanna predict what temperature the ocean will have in a certain time frame in the next few months. You want uncertainty around your prediction. And we'll see what it means if you actually now take, observe this data through maybe buoys in the ocean, make some measurements, store that on some hard drive and you're trying to make a prediction with a computer. In what sense is that data on your hard drive any different to computing on that data? So we'll discuss that a little bit. Okay, so let's get into it. We'll start with a quick recap about what we talked about last time because much of that will reappear in some way today. As last time, whenever you have any questions, simply raise your hand or just shout, I'll try to answer them to the best of my ability. Okay, so we're still on this problem. Gaussian process regression is kind of a prototypical example of a regression problem, trying to learn a function F from R D to R and we have some observation noise of the size sigma squared and we can compute a GP posterior that quantifies our prediction with how certain we should be in that prediction. And as we saw, you know, there's a very expensive operation in there. A matrix inverse has cubic complexity in the size of the data set. What do we see? Well, we saw you can use iterative algorithms, algorithms that do one matrix vector multiplied per iteration to compute an approximation to the precision matrix, so the inverse of the kernel matrix. And that matrix is actually low rank, exactly of rank the size of the number of iterations that you've done. Here at the bottom is one example, the partial Cholesky where you select a bunch of data points through so-called actions. So here's the Cholesky as an algorithm again. Let's recap the most important elements of it. So in the very first, in the second lecture, we saw that Cholesky is kind of prototypical algorithm to invert or decompose a symmetric positive definite matrix. We saw last time we can also interpret it as an approximation algorithm to the inverse of the matrix. So it builds up this matrix CI, low rank matrix as an approximation to the inverse. And it does that roughly did a rough complexity analysis in cost n squared per iteration. So if you do a small number of iterations, we can do this in quadratic time. And there was one specific kind of important choice. This choice of the action, which is a unit vector, that selects a certain data point. So if I go one slide back here, in the first three iterations, we selected these three data points, okay. So high level, I use a lot of language that refer to learning problems, taking an action, making an observation about this numerical agent. And today we'll see, make that precise. How is that actually really a learning machine? Okay, we also saw that it really matters what choice of actions you chose, right? So here on the left, the actions target the first 10 data points from the left. And you can see that the marginal variance, so how certain we are in our prediction here in dark or in red shaded versus the exact two people stereo computed with O of n cubed operations kind of reduces your uncertainty very locally in this region. But I can do better if I, for example, just choose my actions randomly. So I select randomly chosen data points, roughly spread out through this data set. I actually reduce my uncertainty about what the true function is much, much quicker. There's certainly a lot more reduction happening here, even though it costs exactly the same. So the choice of action matters in the same way that it does for a learning problem. It really matters where you observe data. The more new data you observe that's not similar to the data that we've seen, the more you learn about the function or the phenomenon that you're trying to predict. And we're seeing very similar things here for this numerical method. Okay, so the natural question, natural follow up to this is, well, can we find better actions, right? Do we have to select data points? Or can we do something different? Can we do something more, maybe smarter? On the left here for the partial Tulaski, this is the residual matrix. So the part about the kernel matrix that we haven't learned yet, when we choose unit actions, that's just a single one in this vector and the rest is all zero. And here we've observed in this part of the matrix in the middle and then down here where there are still large values in the residual. We haven't really learned what the kernel matrix looks like. And then I showed you that there's actually a different strategy where I choose dense actions, so vectors that have some entry in every single, some number in every single component where I can more quickly learn the kernel matrix. The residual is much smaller, right? They're on the same scale, color-wise. So how would we choose these actions? What actions can we choose that do this better? And the method that we came up with or that we looked at was the method of conjugate gradients. So there's one important conceptual switch that we did that we can interpret solving a linear system, so for us, A is always a kernel matrix. Also similarly, as minimizing a quadratic function here, F, where A is in the quadratic form. And if you take the gradient of this, with respect to X, and set it to zero, so solve this convex optimization problem, that was equivalent to solving the linear system. So that's a equivalent perspective on the same problem that we're trying to solve to get the solution of the linear system of the kernel matrix and the labels. So we introduced a different name for the gradient because in this setting you can also view it as a residual of, if I plug in some estimate of the solution X here, the right-hand side minus AX is a measurement of how close I am to the true solution. And the idea of conjugate gradient now is, well, I could do gradient descent on this problem, maybe the simplest optimization algorithm that you know. And that would be fine, I would get the green line here and I'd take a bunch of steps through this optimization landscape where the blue lines are the equal potential lines. But it turns out I can do something smarter. I can take into account the geometry of this problem. So what that means is, I notice that this is in a way a space that is stretched and the stretching direction is given, you can see this by the equal potential lines. It's like a circle, concentric circles that are stretched. And if I take that into account, I can, instead of walking orthogonally with respect to the Euclidean inner product, I can walk orthogonally with respect to the inner product that's induced by the matrix A. And what that looks like in X space is I take the green steps here, sorry, the red steps here. So I only take two steps and this angle here is a right angle if I would squish my blue line such that they would look like a circle. So it's a right angle step in a different geometry. And it turns out that these kinds of methods all converge in at most n steps size of the problem. And in particular, if I choose the residual as an initial step, you can see that the gradient method, either the gradient descent method and conjugate gradient actually take the same first step, right? Green and red are the same. Then if I do that, these actually converge typically in a very small number of steps relative to the size of the problem. So here again is the conjugate gradient method as an algorithm and the most important thing we saw is that we recognize a lot of elements that we also saw in the Partridge-Chulaski. So we saw there was a choice of action, in this case, the residual. There was a estimate of the inverse CI that's a low-rank matrix constructed from the same vectors or similar vectors. And additionally, of course, now that we're interested in a solution to the linear system, not just an inverse approximation, there's also the solution estimate that gets built up from these search directions. So directions in input space that I follow when I walk. Okay, just saw that, very similar elements in the Partridge-Chulaski inverse approximation as well. Action as well, here the action is different, unit vector versus residual. Okay, and then at the very end, we said, okay, Partridge-Chulaski conjugate gradient, if I do a rough complexity analysis, what's the most expensive operation in both of these algorithms? What's the most expensive operation? In one iteration. Any ideas? What type of operation is the most expensive? It's a matrix vector multiplied with the matrix of size, at most, n, n by n, no? Okay, so the most expensive operation in both of these iterations are matrix vector multiplies with this matrix A of size n by n. So we look at these, could say, okay, they're both quadratically expensive. Can we do this cheaper? Can we find an even faster approximation to a Gaussian process? And at the very end, I very quickly just teased that there are linear time GP approximations, most of them based on inducing points. So summary data points of the training data set. And typically, you optimize where these summary data points in gray actually lie. So from the bottom to the top, we kind of optimize the locations of data points that summarize our data set. And then I can approximate my blue mathematical exact Gaussian process that I would only have access to with O of n cubed operations with this orange Gaussian process approximation to be called the sparse approximation in orange. What was the big issue that we pointed out at the very end with this plot? Would you be satisfied with the orange approximation for the blue one? People are shaking their heads, right? So one problem with this is, of course we can't expect this to be exact because we're only investing a certain amount of computation, but certainly also the uncertainty is very miscalibrated, right? So in regions where we know nothing about the data set, we are not sufficiently uncertain as we should be based on the kernel. And in regions where we have a lot of data, for example, here, it's a little hard to see, but there's no uncertainty essentially in this point, even though the data and also the exact posterior mean is far away, right? So this linear time approximation tells me you should really trust your prediction here, but actually it's wrong, and you should somewhat trust your prediction here, but in actuality, we should be even more uncertain about the true function lies, okay? So can we fix this? Is there a way to get around this problem, which is the reason to use Gaussian processes in the first place, right? One of the big reasons. One is we can really nicely input what we know about the function we're trying to learn by the kernel, and the second is we get an uncertainty estimate so we know how much to trust our prediction given our model. And this second point here is in this illustrative example it validated, right? I shouldn't really trust the uncertainty quantification there. So can we do this? And we'll see how to do this today. So we'll see in what sense we can actually get exact uncertainty quantification no matter how much computation we have available. And we'll dub this computation of where GP inference. Before I get into it, questions about the recap? Okay. All right, so quick disclaimer. I said last time you saw the state of the art, what you're seeing today is in a way the future, okay? So all of what I'm going to present today is gonna be presented at NURBS later this year. So this is as cutting edge as it could be because this research paper's not even published yet, technically, okay? So, but take it with a grain of salt because this is my work, Philip's work, and in collaboration with Jeff Kleis, Marvin, and John Cunningham, and Jeff is the author of GPyTorch, the principle maintainer. Okay. Let's jump back to the very beginning of this talk. We took a look at the Gaussian process posterior and we always focused on that there is this matrix inverse here. But really, let's take another look at this and let me rewrite this inverse times the labels minus the prior prediction as just a vector, okay? So I can, this is just a vector, right? There's just a solvable linear system, which is a vector, and I'll give it a name, represent our weights. And the reason why we give it this name is let's take a look at what it actually does. So this, the evaluation of this kernel is a bunch of functions, k of x, that are centered at the training data. It's evaluated at all the training data points on the right, in the right argument. And on the left, there's an arbitrary x that I evaluate. So one way to rewrite this, just writing out the inner product is a bunch of kernel functions centered at data points xj, so that's in the plot, the gray vertical lines here, and at each of them, there is in red, a kernel function that has the same shape, you can see all of these red shapes are looking the same, that is weighted in some way by this entry of the represent weights, the corresponding entry of the represent weights, okay? So if you think of the kernel functions as being representative of the shape of the true function, that motivates the name that I weight these representations of the true function in a certain way and then I sum them up to get my prediction, okay? So the posterior mean is a linear combination of these representative functions defined by the kernel. And here it's also very clear that the kernel encodes assumptions about what the true function looks like, right? If the shapes that I defined for my kernel here look very similar to the function, I can trust my GP to model it very well. Now, what happens if I use an iterative method to approximate these represent weights? Because that's what we always talked about, right? We talk about solving this linear system with the kernel matrix and the labels or labels minus prediction, minus prior prediction with an iterative solver. So we get some kind of estimate of this linear system solve. It's called VI. So we don't really get the exact mathematical posterior mean, and I call it mathematical because you have to, there's an operation in here that we have to do a mathematical operation. But we only get an approximation VI of these representatives V star. And what does it look like in this schematic illustration of the sum of kernel functions? So I don't get the true red, sorry, I don't get the true red dash weighted kernel functions, so if you go slide back, now the red lines are dashed lines because I don't have access to them, I have to approximate them. And instead I get these solid red approximations of them. And you can see that here, for example, this one is pretty well-scaled, right? It's almost equal to the dashed line. Same for the outer data point here. But for these centered weighted kernel functions, you can see that the approximation is so bad and not even close to the true kernel functions centered at this data point weighted by roughly minus one, okay? So in this view, we can think of these iterative solvers approximate these weights for these kernel functions that construct our prediction. Now, what if we found a way to quantify this approximation error between the true representer weights and our approximation? And ideally, we could do this probabilistically because we're dealing with a probabilistic model, so maybe there's hope that we can add this additional uncertainty coming from estimating these weights to our prediction to our posterior, okay? So that's the idea. Estimate these weights, but keep track of the approximation error, the distance to the true representer weights, and add that to our uncertainty. All right, quick interlude recap for those of you that maybe haven't had that much connection with the linear algebra of Gaussians, so the linear algebra of inference as we like to call it. So Gaussian distributions have a ton of nice properties that make them really easy to work with in probability theory, specifically when it comes to conditioning and observations. And we've seen a little bit of that with Gaussian processes, but we'll recap one important property that we'll need to derive our algorithm. So first of all, some simple ones. So if I have a Gaussian distribution over a random vector z, if I take a matrix vector multiplier of that random vector, the result is a random vector that still has a Gaussian distribution, and I can actually directly give its parameters, matrix times the mean, and then the covariance matrix wrap from the left and right with the matrix. So we can think of this in the very simple case of a scaling, right? If you just scale your random vector up, of course your mean prediction also gets larger, but also the variance in your random variable increases. And the second important property is marginalization. So if I have a joint Gaussian random variable, so x and y, they're jointly Gaussian, I can marginalize away, in this case the random variable y, and I can just take a look at the entries of this vector essentially and take out the parameters of the Gaussian to find out what the distribution of the random vector x is. And I can combine these properties to get a nice result that you can also use to prove what the Gaussian process posterior is from a weight space perspective. If I have a prior variable x here with prior mean and covariance mu and sigma, and I have an observation model, so I make observations of this random variable through a model by y here, where I think y is an affine map of x, okay? So matrix vector plus some bias. Then if that's the case, I can simply write down the posterior distribution over this x given these observations y. And let's take a quick look at what elements pop up here, and you might recognize some of them from GP in France. So here in the very front mu is my prior mean, right? And let's think of the matrix B as just the identity for now. So I have my prior prediction, and back here I have my observations y and subtract from them my prior prediction, a times mu, because remember y is observed through a, is x observed through a, and then I scale this observation how much I should update, how surprised I should be by my prior uncertainty sigma, my observation noise lambda, and my prior uncertainty, I have to multiply from the left and right with a, because that's how I observe y, right? That's this rule back here. So you should recognize this a little bit from just Gaussian inference, okay? But we'll now use it in a different fashion. So let's take one step back, our goal is can we quantify the approximation error in our estimate of the representer weights? And we wanna do so ideally probabilistically. So what does that mean? So in this graphic here, the red dot is our estimate of the representer weights VI after I iterations, and in black are the true representer weights. And this is a 2D example, so this is actually the very initial point. If I have a Gaussian distribution over this red dot, so where the true representer weights V star are, so a Gaussian distribution over V star, my belief where V star is, what that looks like is, I have a mean estimate VI, that's the red dot, and then I have these concentric ellipses that represent my Gaussian bell curve in 2D. So the closer I am to this dot, the more intense the colors, the more certain I am that the true representer weights lie in this direction. So you can see that this is a way to quantify error, right? The spread of this Gaussian is how far away I think the true representer weights are, and then it's orientation and squishing give me kind of an idea in what direction I should look for the true representer weights. It's less likely to be up towards the right, and then it's more likely to be towards the top left and towards the top bottom right. Okay, so let's say we had a prior like this, just a Gaussian distribution that somehow quantifies how far away I am from the representer weights. How can I update it? How can I find out more where the true representer weights V star are? I might just start with a guess of zero, but I wanna figure out how do I learn these representer weights? So what I can do is, I can't directly observe the representer weights V star, right? That's to get them, I would have to solve this linear system inverse, and Philip, maybe you could help me for a second with a pointer. So I can't directly observe this dot, this is the black dot here, but what I can do is indirectly observe it, okay? And I can do that in the following fashion. So remember that the residual, we talked about that earlier with CG, is the labels minus the prediction, prior prediction, minus the kernel matrix times my estimate for the representer weights, okay? So this is the residual. And then let's say we have some vector SI, doesn't matter what it is for now, just some vector. Then I can pull out this kernel matrix K hat here to get SI transpose times K hat, that's just another vector, times the difference between the true representer weights and my guess for them. Where have you seen some linear map times the quantity that we care about that we have a Gaussian distribution over just a few slides ago? This is just an estimation problem, right? This is just Gaussian inference, and we observe through a linear map. So we can apply our nice theorem that we just saw at the bottom, just a few slides ago, namely down here, right? We have a representer weights quantity, in our case it's V star, and we have a linear projection of it through some matrix, so we can update our belief, our Gaussian belief over these representer weights. So what does it look like geometrically? So I have my representer weights estimate V i in red here, and I pick some direction induced by the vector SI, that's the action again, times the kernel matrix, and that gives me a direction in this space of the representer weights, the red line. And now if I observe my representer weights, what that corresponds to is I project the distance between the guess and the true representer weights onto this line. So I get this distance here between the two on that line. And as we said, that's just an affine Gaussian inference, right? That's just the theorem that we saw two slides ago. So we can apply directly this nice affine Gaussian inference theorem to update our belief about where the true representer weights are. So what happens now is from this initial Gaussian ball, we actually observe without noise, right? Observation, that's exact. There's no noise in this observation alpha i. We directly get this matrix vector product exactly. So our belief collapses onto this line in which we haven't actually observed it. We only observe in this direction, the distance. We only have the distance here. We don't know how far away we are in this direction. So our belief collapses onto this orthogonal line of the line that we looked in and we get a new belief that's one dimension less, one dimension less, in this case, one dimensional. And this orange here is the Gaussian ball curve in 1D that's on this line. So I think my belief, my estimate is here that my true representer weights are there at the red dot. And I think they're within roughly this region, maybe that's one or two standard deviations, okay? So now I have an actual learning algorithm. So just a probabilistic belief update, right? That's a probabilistic learning algorithm to find these representer weights. And I can do the same update again now by choosing a new search direction. Before we look at the form that this takes, observe something. Is this red line in the way it's oriented the best choice here? Do you think there's a choice that's more informative? Remember, what I get is the distance between the red dot and the projected black dot. The projection is given by this light red line. So I observe this distance. And then I update and I move towards this intersection with my point. What would be a better choice, do you think? Can you draw a red line here that would be more informative about where the true representer weights are within one step? So remember, we move from this point to the intersection of this, these two, to the projected point in this one-dimensional subspace. Yeah. The line that goes through the actual representer weights. That would be a direct step to the solution. But anything that's kind of more aligned with this ellipse with our prior belief here would be a better choice. So in this case, this is most represented by in this direction. But it could be that our prior belief kind of looks differently. It could be slightly rotated, but you're right. The best step obviously would be directly to step to the solution, but that's probably just as costly as solving the system itself. Okay. So let's take a look at what the posterior after the update actually looks like. So now we have an algorithm that updates a belief about the representer weights with an observation that we can make just an action times a residual to get a updated belief about where the representer weights are. And this updated belief, you can directly derive through this affine Gaussian inference. And we'll look at the elements that we saw before again here. So vi minus one is my guess for the representer weights currently. My prior mean, here's an update term. That's the difference between the quantity that I care about and the quantity that I estimate it to be scaled by how certain I should be or how surprising I think I should, how surprising this difference should be to me, right? If I'm very, very uncertain at the beginning and I see a large difference maybe I shouldn't be that surprised that I don't know where the true solution is. If I'm very concentrated, I think I really know where the representer weights are and suddenly the true representer weights are far away, I should make a big update to actually get to the true representer weights. But that's not the entire story, right? Because we have this linear map in between SI transpose K hat. So that's in the linear Gaussian inference theorem. That's this matrix A here, the projection of our representer weights. So we only see a projection of it. We only see on this red line where the black dot is. That's our projection of it. Okay. And now I gave names to these quantities and we should start to recognize some of these from the previous lectures. So here that's just the projector residual. We had that up here, right? SI transpose R i minus one is SI transpose K hat times the difference between the representer weights. That's my observation. I can compute that, great. Now what's in the middle here is that a scalar matrix, what dimension does that have? The thing that I invert, it's a scalar, right? Because I multiply from the left and right with two vectors. So that's also easy to invert, great. And then the final thing here is my previous uncertainty times K hat SI, so the map that I project through. Okay. And I gave all of these quantities names. I gave this quantity the name DI and it pops up again also in the update to how certain I am about my representer weights. My previous uncertainty is reduced by a quantity that depends on what I already know about my representer weights in the direction that I'm looking and these same vectors DI transpose. So that's an outer product again of vectors that's a rank one update and it has the same form that we've seen for CG and for the partial Scholesky, okay? Remember that's also an outer product of vectors DI. And down here we just rewrite it as collecting these actions SI in columns of a matrix. So this matrix capital SI is just all the actions collected into columns. So now I have a way to learn the representer weights. I have an actual learning algorithm and it looks like it seems very, very similar to CG and to partial Scholesky. But what have I not told you yet? I haven't told you yet how we should choose the prior, right? How do we get this ellipse? How do we get this error estimate? We have to start somewhere with our belief, right? And this seems like a hard problem because it has to be related to where the true representer weights are. Otherwise that's not really a great error estimate, right? So at the very beginning, before I observe anything, I already kind of have to know where the true representer weights lies to get a good ellipse here. So that seems kind of backwards, right? So how the heck do I choose this prior? Well, first, let's think about what problem they're actually trying to solve. They're not trying to solve a linear system, they're trying to do GPU regression. They're trying to learn representer weights. So we have a Gaussian process prior already and the representer weights are certainly related to the assumptions that we make because they're involved the inverse of the kernel matrix. So they're related to this prior assumptions through the GP prior. Specifically, our observation model for the GPS, we have some latent function F that we're trying to learn that we place a prior over, plus some noise epsilon that's independent. So if I make a vector valued observation, a bunch of labels, y, y1, y2 and so on, I assume that to be generated from the true function evaluated at the training data plus some noise that's independent. Now, so before I make any observations with my GP, how do I think the data was generated? I can use my prior to try to understand what distribution of data was generated from. And according to the prior, if I assume I have a distribution over F, namely the Gaussian process prior, what I get is the labels are distributed according to, with a zero mean, if I assume a zero mean Gaussian prior and they vary according to the kernel matrix plus this independent noise. Because that's part of my observation model. So a priori, before observing any data, this is what I think that data looks like. Now, the representative weights we said are just a reparameterization of the data, right? They're just inverting the kernel matrix times the data. So before observing anything, what I believe the representative weights to look like is just k hat inverse times this distribution over y. My prior predictive distribution for the labels or regression observation values. So if I multiply with k hat inverse from the left, remember from the rules of Gaussians, I can just from the left and right multiply with k hat inverse up here. So what I get is a Gaussian with zero mean and k hat inverse in the covariance. So now I have my prior for my representative weights, right? Because that's what I wanted. But before I observe anything, V star, I have a belief over what V star is, but where's the problem? Yeah, it involves the thing that we're trying to learn, right? So did we just go in circles? Did I just waste like 10 minutes, 50 minutes of your time? To get an algorithm that updates the belief to find the representative weights that actually quantifies its own error that has this nice ellipse here that is elongated in the direction of the true representative weights, I actually need the thing that I'm trying to compute. I need to invert the kernel matrix. So what do we do? Well, again, we're not trying to estimate the representative weights only, but trying to do GP regression. So let's take, again, a look at the problem that we're trying to solve. So we have a Gaussian process, our posterior evaluated at a new data set, let's call it X diamond, is just the mathematical posterior mean is the prior mean times these kernel functions centered at data points times the representative weights. We've seen that a bunch of times now. Similarly, for the Gaussian, for the posterior kernel, posterior covariance function, I have my prior kernel minus this update that's given by K-head inverse. Now we also just now found a way to learn representative weights. A probabilistic learning algorithm that given a prior over the representative weights updates this prior by projecting onto actions and after I iterations I get a Gaussian belief over the representative weights. And this Gaussian belief has a mean estimate of the representative weights. That's given by this inverse approximation times the labels minus the predictive mean and uncertainty represented by this covariance matrix that is this error estimate. And this covariance matrix gets updated by the inverse approximation. So here we can see that CI really is an inverse approximation, right? Because the more I learn about V star, the closer CI gets to K-head inverse. But how do we solve this problem of needing K-head inverse as our prior, as our calibrated prior? Well, let's take a look at now what if we, now we quantified the uncertainty that comes from approximation. What happens if we try to push it to the posterior? And what I mean by that is this function, the mu star, the posterior mean is a function of V star. It's a linear function of V star. And we have a Gaussian belief over V star. So what we can do is, first of all, we can compute a distribution of a mu star evaluated at some data point, right? Because it's just a linear map of a Gaussian. That's one of the properties of Gaussians. But what we can also do is we can marginalize our belief. So we can consider all possible representar weights according to our belief. So dots all around these ellipses and more of them in the middle and then less on the outside. I can consider all of them and compute a posterior mean prediction that takes all of these possible representar weights, estimates into account. And what happens if I do that, if I marginalize? Now linear marginalization just means, if this is a Gaussian V star, I just replace in the mean the variable V star with my mean estimate for the Gaussian random variable. Yeah, that's just marginalization. And then for the covariance, I get additional uncertainty coming from this marginalization because there are multiple options for V star, right? So I need to consider all of them. So I should be more uncertain about what the true value of mu star is. So I get an additional term here, additional to the mathematical uncertainty from up here that is the computational uncertainty. And that's precisely the matrix times the random variable that I care about is V star. And now to get the uncertainty of the covariance of K of validated some dot comma X times V star, that's just multiplying the same linear map from the left and right to the covariance matrix. That was, jump back a little bit, this rule here, right? If I have V star in this case is E and I look at AZ in our case K dot comma X of V star, I get my covariance matrix from the left and right multiplied by this matrix. Let's jump back. That's precisely what we see down here, right? Sigma I, that's our uncertainty about the represent a weights multiplied from the left and right with the linear map in front. And now something happens, right? Because Sigma I actually is K hat inverse minus CI. So what I get is the K hat inverses here and here they just cancel because here it's a minus sign in front. So what I get is what we call the combined uncertainty that includes the error to the true function and then I'll make that precise in a moment and the error in the represent a weights into one single estimate. And the clue is, the nice thing is I can compute the CI. I don't need K hat inverse to compute the CI. I just need these vectors D I that are involved the Sigma I minus one. But remember Sigma I minus one is Sigma zero. So K hat inverse minus some update terms. So the K hat inverse with K hat cancels again here as well. So we don't actually need it to compute our precision matrix estimate. And even if you don't fully understand the derivation, it's really as simple as replacing K hat inverse with the learned inverse approximation, right? It's a one to one correspondence here. And similarly I replace the represent a weights with my estimate of the represent a weights. Now let's take a look at why this is really motivated. So clearly marginalization results in this, but let's take an interpretation of these terms. Let's take a look at the interpretation of these terms. So remember the covariance of a random process F for our case at the Gaussian evaluated at some point X with itself is the expected error, squared error between the random process F value at X minus its expectation. So for us that's F of X minus Mu of X. Mu of X is our prior mean, okay? So if I look at what the posterior covariance, what kind of interpretation it has is actually the squared error between the true function and F that I'm trying to learn minus my posterior mean prediction. So F is, for us is a random object that have a distribution over it and Mu star is my posterior mean prediction. That's why we call it mathematical uncertainty because there's a mathematical operation to arrive at it, but it might be too expensive in practice. Okay, so in this sense, the uncertainty of my GP is an error estimate. It's a mean error estimate, right? Because it's an expected value. In the same way, this term here, computational uncertainty, let's call it KI comp, is also an error estimate because we said that sigma I is the covariance over the representative weights. So my belief about the representative weights, how certain I am, I actually know the true value of the representative weights. So it's the expected error, squared error between the true representative weights minus my estimate of it, my, the mean. So if I now consider not V star, but K of X comma capital X times V star, my, the expected error that I get here is between K comma X of capital X times V star. That's just the posterior, the mathematical posterior mean minus the same replace with VI, which is the approximate posterior mean. Okay, so this is the error between the thing that I'm trying to learn, the posterior mean, the exact one, minus the one that I'm estimating, my estimation of it. So if I combine these two, what I do is I consider the error that comes from not having observed enough data to learn the true function, plus the error that comes from not having done enough computation to actually learn what my prediction should be for the data that I have. Okay, and I can just combine them up here into one combined uncertainty that I can actually compute. What does that now look like? Let's take a look at the, at Cholesky actions. So we take unit vector actions. Here's a plot where I, in black dashed lines, have the true latent function, which is a sign here. I have the data in black dots, the mathematical posterior mean, so the object that I'm trying to learn if I had cubic amounts of computation and my approximation of it in blue. So that's, the blue line is mu i. And then I have the two mathematical and computational uncertainty in here in this plot. And at the top, I show two standard deviations and at the bottom, I show the variance. So at the top, I would expect that the true function that I'm trying to learn, black dashed, always lies within both shaded regions. So within the sum of the shaded regions. And what happens is if I now run my algorithm, you can see that I, at the bottom, I kind of reduce my uncertainty about where the true function lies. And the blue line, my posterior mean approximation gets closer to the black line, the mathematical posterior mean. Okay, so that was Cholesky for five iterations. So observe how the black dashed line always lies within these two sigma region with high probability. And observe also at the bottom how my computation uncertainty gets carved out, right? The more I compute, the less computation that certain I am. But I always remain mathematically uncertain because obviously I only have observed a finite amount of data. So I can't be fully sure what the true function is. But I can reduce this green uncertainty which comes from not having computed enough by computing more. Let's take another example. Let's take CG actions, so residual actions. Before I do any computations, same situation, right? Posterior mean approximation, variance at the bottom. I do one iteration, and that looks a little bit different. So let's do two iterations. Already after two iterations, my approximate posterior mean in blue is much closer to the true posterior mean than we saw for the Cholesky. And my uncertainty reduction is much more global, right? Previously, let's jump back a little bit. It was very local, actually centered at the data points. Now with CG, it's much more global, even in the first iteration. And that's because we actually target compute to all data points. We look at the residual. We don't only look at a single data point. And also the posterior mean converges very quickly to the true posterior mean. But that doesn't mean necessarily that I already have learned everything about the true function, right? So after five iterations, my computational uncertainty is very much reduced almost everywhere. And very close to the true posterior mean, and the true function that I'm trying to learn lies within nicely within these two regions. Okay, now let's take a look at the algorithm. So essentially now this method that we proposed called IterGP really just wraps this representative weights estimate, this learning algorithm for the representative weights into one that computes the GP posterior. So we'll go through the steps one by one. It looks quite complicated, but it's actually very simple. And you've seen it in different forms already for the Cholesky and for this CG. So we start with the representative weights belief where we, for example, initialize the representative weights to zero, our estimate of them, and our precision matrix approximation also. Because we don't know what the actual kernel matrix inverse is. And we don't really actually instantiate the belief. We just virtually track the uncertainty about the representative weights through CI, okay? So we don't compute k hat inverse, you only track the update. Because we never actually need to invert the kernel matrix to use the belief and the way that we use it. Then we choose one line here, an action, in which to look. So, and that is SI transpose times k hat to observe how close we are to the true representative weights in the subspace that we've chosen. And here you can see that the policy really makes a difference of how quickly we converge to the true representative weights, as we've seen for CG, right? Because I can choose to certainly very informative actions in this direction. And I can choose very uninformative actions in this direction, right? If I projected the true representative weights V star along this line here, it would lie directly on my approximation of it. So I wouldn't really learn anything about the representative weights in that step. But I do learn something about the inverse, just not in the subspace that Y lies in. Okay, then we observe our projected residual. We observe the distance between V star and VI minus one. Along this line, so the distance between the red dot and this shaded intersection. And I compute the direction which I have to walk to update my representative weights estimate. And then also I compute this low rank approximation. And also in memory, I don't actually form this. Because if I have a low rank matrix, as we've learned in the second lecture and in the first exercise, I just track the low rank matrices individually plus maybe some diagonal, time some diagonal scaling. And here's an illustration on the left is the true precision matrix. I think that's actually the example of CG that we saw before. And a few iterations of CG that has learned the precision matrix approximation. And you can see that in the bottom here, for example, it's learned some structure on the diagonal as well. But there are some regions, for example, here, where we don't really know what the true uncertainty should be. There's still computational uncertainty left about it. Okay. So now we have an algorithm and you recognize all of these quantities, right? Quantity action is from same language that we used last time. And then finally, of course, we update our estimate of the representative weights and we update our precision matrix approximation. So we saw the same quantities. Does the partial Cholesky and CG actually fit into this framework? And of course it does. Otherwise I wouldn't have showed you. We can choose unit vector actions to recover the partial Cholesky or also called subset of data. We can select preconditioned, potentially residual actions to recover preconditioned CG. It's actually a second way to recover that by already choosing conjugate actions instead of just residuals because they automatically made conjugate. And finally, we can also design a method that looks very similar to this linear time method that we saw previously by choosing actions that are actually kernel functions. So that weight data points according to a kernel function that's centered at an inducing point. So let's go through these two dimensional view of what happens if I choose certain actions and what the interpretation of choosing an action is. So if I choose for Cholesky, certain unit vector actions, what I'm doing is I'm selecting data points here in red for iterations. And around these data points, you can see that the computation uncertainty in green is reduced. So where I target my computation, I more quickly learn about the thing that I actually care about at the mathematical uncertainty. And if I sum these two up, this is the object that I can actually get. So for it a GP Cholesky, where I target my computation, I reduce combined uncertainty. Now for CG, I target globally. I just target some data points more and some less. Okay, I use a vector where I have entries that correspond to how much I target my computation. And you can see that in regions where I target more, so where the data points are more red that corresponds to the magnitude of the entry in the action vector, I get much more of a reduction of combined uncertainty on the left versus where I don't target my computation where computational uncertainty remains high. Okay, and then finally this kind of SVGP similar method. I sort of center kernel functions around this data point and weigh the data points in the input space. Sorry, I don't have this reducing point according to what the kernel function says. So the magnitude given by the kernel and I reduce my uncertainty near these place kernel functions. And that results in a method that kind of fixes the problems of SVGP. On the left here, you have another SVGP posterior and you can very clearly see, so I get my mouse back, you can very clearly see at the top that in some regions, the uncertainty is very small. And clearly the red posterior mean approximate one is very far away from the black line, the true posterior mean. In fact, we can measure that error in the variance and in the mean and the bottom panels and sometimes we underestimate the mean, overestimate the mean and similar for the variance. Sometimes we're underconfident compared to the mathematical posterior and sometimes we're overconfident. Now if you choose these kernel vector actions, what we actually get on the right hand is iterGP. So you can see that, first of all, if you look at the top plot, the exact posterior mean in black lies within the green region of computational uncertainty. That's where I think the true posterior mean could lie compared to the approximate posterior mean. And then also if you compute the variance error, so how over or underconfident our method is, it's always less confident if we haven't computed enough, exactly as it should be. If you don't know exactly where the true posterior mean is, we should add additional error to our estimate. But these kernel vector actions are dense, generally. So our complexity is O of n squared per iteration. We said that previously, one of you said the most expensive operation is a matrix vector product. So we pay O of n squared for this, right? So is that great then? Okay, we fixed the problem, but we paid a lot more for it. Is that necessary? Let's take another look at what the actual cost of this algorithm is. The three points where the kernel matrix actually appears is in the observation, so computation of the residual times the action. In the search direction, where we take the action and correct it, similar as we did for CG, a conjugate correction, and then in the normalization constant that ensures we walk only as far as we should walk in this quadratic. And let's take a look at the first one. So what is Vi minus one actually made up of? Well, Vi minus one is just a sum over all the search directions as it is for CG, right? The steps that I've taken. It's just a linear combination of these steps. So really, all the terms that appear in there are action times kernel matrix times steps that I've taken. Now, similarly for the search directions, my inverse approximation CI minus one is an outer product of these search directions D, J. So all the terms that appear in here that involve khat are also of this form, Si transpose khat times the search direction vector from all the previous ones or the current one. And clearly, the normalization constant also has this form. It's just defined this way. So what if we chose actions that only target part of the data points as we do for CG, for Cholesky? So what if we chose action vectors Si that have a bunch of entries for some data points? So one, two, and so on. And then have a bunch of zeros. And all the actions that we choose only ever have entries, non-negative entries up here. Well, what happens is clearly we only need to look at if we multiply from the left with Si transpose at the upper part of this matrix because all the rest down here gets zeroed anyway. But let's take a look at DJ. So DJ is the action minus previous search directions. So in the very first step, in particular, it's just the action. So in the very first step, I really only need the upper quadrant of this kernel matrix. And it turns out that now the actions inherit this, right? Because in the first step, D1 is just the action itself. So it also has a bunch of zeros down here. D1 looks the same. And now if I take D2, I get S2. For S2 we just posit it. This only has entries, non-zero entries up here. Only targets this top quadrant. And I subtract the previous D vector. That also only has entries up in this upper quadrant. And recursively, that stays true. So actually, if we have actions of this form where we never have non-zero entries down here, we only ever need the entries of the kernel matrix in this upper quadrant. So what is the actual complexity? I've already showed it. I'll take it away. What is the actual complexity per iteration for all of these vector matrix vector products? Say size K is the block size K. Oh, K squared. K squared, yeah. Or actually size L, L squared. So we can actually decide how much computation we want to expend per iteration by defining a policy that defines actions that only targets certain data points with compute. And the magnitude of how much they're targeted is the magnitude of this entry. So we can design a method that has arbitrary, choosable computational cost, but always has this error quantification property. And in fact, we are not even limited to some data set that we have available. You can technically work with infinite data. Because nobody stops you from just extending this kernel matrix to infinity, technically, loosely speaking, if we never actually target it, we never actually need the entries. So this is just a formal version of this statement that if I choose actions Si tilde that have a bunch of zeros down here, and I run my algorithm energy P on some extended data set, X plus X prime, where X prime is, you can think of as an infinitely extending data set of observations that I can make. It's the same as if I had run energy P on the left here on the small sub matrix at the top. So virtually I can run on a data set of arbitrary size if I wanted to. And that gets us to a slightly philosophical interpretation of this algorithm. So think of what actually happens when we collect data for GP regression. So we go out into the world on the left, make a bunch of measurements. I collect my buoys out of the ocean to measure the temperature. And then I make notes of that and maybe store that on my desk on my computer. But we're in the modern age, nobody actually, you know, hand writes a prediction and computes a GP procedure by hand. What you do is you use your machine to operate on that data to make a prediction. But it doesn't matter whether the data is on your disk or in the world, if you haven't targeted it with computation as we just saw here, right? It's the same as if I operate on some virtually larger, maybe even infinite data set, if I only target a subset of the data that I have available. So only really data that I compute on is an observation that I make. So I can think of this as either all of it is data. Whatever I target is the observations that I make the linear projections that I make out of the data or everything is compute just the action of going out into the world and making an observation is part of my computation to make a prediction, okay? So there's really no real distinction anymore except that we have this disk in the middle that I load my data from or load part of my data from. And if you take this view seriously, we can also model the situation just with the Gaussian process model, right? We can take a prior as we did all the time. But now instead of observing data point after data point after data point as we assume if you just do a standard Gaussian process regressor, we only observe linear projections through these actions as I transpose. I only observe data with this linear projection. So I target computation to certain data points not to some others. And however much I targeted it that much influences my posterior F given X comma Y tilde. And Gaussian processes you can just condition on projected data as well. We saw that with the linear Gaussian F and Gaussian inference theorem, right? That was the finite dimensional version of this. And it turns out that if you take the situation that's precisely the combined posterior that you get. So you can think of energy P not as you can think of it in one way as we put our probabilistic numerics hats on we quantify the error that gets introduced through approximation and we take that into account in our prediction. Or you can think of it as just a more accurate model of what's actually going on if you're making a prediction nowadays today with your computer. Because you actually have to load the data into your model and maybe only part of it. Okay? In some sense it's actually just a more accurate model of what's going on in our setting. Now we've talked a lot about computational and combined uncertainty. And we talked about this interpretation as squared errors if we quantify the squared error probabilistically can we make that even more precise of what that means of what if we now use the combined posterior as a prediction from the illustrations here? We had the intuition that if our combined posterior is a good error estimate the true function in black dashed should lie within the combined posterior so blue plus green. And if our computational uncertainty is a good error estimate the black line the true posterior mean should lie within the green region. Our error estimate for how far our approximate posterior mean is from the true posterior mean. Okay so let's see whether that's actually true. And of course that's the way we derived it so it should be true but we can make very strong statements about what this actually means. And I promised you in the very very beginning that I'll show you a way how to do exact uncertainty quantification with an approximate model. And this is the full filled promise now. So at the top you just have a similar plot with a different policy in this case but it doesn't matter. True function black dashed black line is mathematical posterior mean blue line is the approximation. And what holds for a Gaussian process you can actually interpret the standard posterior predictive standard deviation. So the posterior covariance function evaluated at the same data point plus my noise and then the square root of that so that's the standard deviation. That is a bound on the relative error between the true latent function that I don't know can't evaluate it and the mathematical posterior mean but it's a relative error. Okay so the uncertainty in one way is a mean squared error as we saw in the very beginning but it's also actually a worst case bound if we normalize this latent function. And I realize this is just written words. I'll show an exact mathematical statement after this but this is the intuition I want you to take away from this. And now the thing about RGB is it fulfills the same statement, the same strength of statement just by replacing the mathematical posterior mean which we don't have access to with your approximation that you can compute in arbitrary time. Okay. And this is now bounded not by obviously the posterior predictive standard deviation which we don't have access to because it also involves an inverse but this combined standard deviation which just requires the inverse approximation which is low rank. Okay so we basically get the exact same we can trust our uncertainty quantification exactly the same way we trust the Gaussian process and we can do it in however much computation we have available. And now for those of you that want a precise statement this is the actual theorem in the paper so feel free to read up on it I'll get with read very quickly. So the supremum of the difference between all the functions true functions G that could have generated the data minus our approximate posterior mean at a point X is actually exactly the square root of the combined variance plus the nodes and within that and this object we don't have access to because we would require the K at inverse but we don't need that, right? We just need the distance to the true function. We can think of the computational error as the distance between the true mean for that latent function G minus our approximation. And now in the image this gives the justification of why up here the black dashed line lies within the blue and green region, right? That's the, let's actually take one step back. That's the combined standard deviation up to scaling so I scaled this plot here to satisfy that. And then the distance between the blue and the black line approximate posterior mean exact posterior mean lies within the green region that's the computational uncertainty and with that I'm actually at the end a little bit early today but I'll quickly summarize so we can approximate GPs by devising a learning algorithm, probabilistic learning algorithm for the representative weights. We can quantify the error of this algorithm of this learning machine probabilistically and we can actually push that error onto our GP posterior that we're using to predict anything. Now variants, there are different variants of this method exist that might converge more quickly or have different properties but all of them no matter what policy you choose as long as your actions are linearly independent give you exact uncertainty quantification no matter how much compute you use. And we even looked at the situation of one way to think about this energy P algorithm is it's really just a better model for what's going on when we make observations and then use a computer to make a prediction. Okay, with that I'm at the end next week you'll see Jonathan, another Jonathan to talk to you about the situation we always said we only have measurements of the latent function that we're trying to learn and we're using a Gaussian process model to learn that. Now what if you actually have some maybe partially known physical law that also governs this latent function think of the dynamics of a disease for example so somehow you know a physical law this case an ordinary differential equation that governs the behavior of the function that you're trying to learn additionally to observations from it. We only talked about observations next week Jonathan is going to extend that. With that I'm at the end. I would love if you give me feedback and please do not get up yet because Professor Hennig also wants to say something after that so take two minutes to give feedback and then we'll have a quick word from Philip. Thank you.