 Welcome everybody. I'm very happy to pick up where Marvin left off last week. I'm Jonathan, and I have the distinct pleasure to today tell you about how to scale Gaussian processes to large data sets. And what we'll see today is that we'll use a lot of the tools that we've developed last week. In particular, last week, we talked about the textbook way to solve linear systems using the Tulaski decomposition. And we applied that to GPU regression. Today, to give you context where we are in the course, today we're essentially developing the modern way to solve such linear systems with kernel matrices and apply that to scale Gaussian processes. And then next week, we'll give you the probabilistic numerics interpretation. So how do I learn about the machines that I use to build learning machines? So next week, we'll be the learning machines view. Today, we'll be kind of the preparation for that. Now, I'll start to use language that interprets these numerical methods as learning methods as well. So today, GPUs on large data sets will do that by using iterative methods to solve the new systems. We'll see, in particular, that these methods can be viewed as learning algorithms for the matrix inverse, which is the primary object that we need to compute the GPU posterior. And then we'll develop a method to do quadratic time as opposed to cubic time inference for GPUs. And finally, at the very end, I'll give you kind of a hint that you can actually do this faster in linear time in the size of the data set. But that might come with some drawbacks. And then we'll talk more about that next lecture. OK, so now that we've situated us, please during the lecture at any point, if you don't understand anything, just ask me. I'll do my best to answer. And with that, let's get into it. Quick recap about Gaussian processes. Talked about them last week a little bit. Why do we kind of focus on them as a base case so much? So the main goal for regression is we are trying to learn an unknown function, in this case f star, from a training data set of examples. Kind of the basic example of regression. And what do we want from our model if you do regression? Ideally, maybe the most important thing is we wanted to generalize to data that we do not have in our training to the set. But there might be some other goals as well, alongside this generalization goal. So second of all, we might want our model to be sufficiently simple or interpretable that we can actually understand what's going on after training, what the model has actually learned. Then third, we really would also like to know how much to trust the prediction. If you just get a point estimate, we might be not actually sure how much we should trust in that point estimate, is that a good prediction far away from the data. And then finally, we really want this to be fast. If it doesn't help us, if our model trains for, let's say, 300 days, then we can use that really well in practice. And GPs are really kind of the prime example where we can hit all of these points. So for example, of course you want generalization. That's just what we need for regression. But then GPs in some way are simple and interpretable because we can reason about what the kernel kind of encodes, what assumptions about the data. We get an uncertainty estimate that tells us how much we should trust the prediction, assuming our model is correct. And then finally, as we've learned last week, it's actually very expensive to train them and do inference. So what do we do about that last point? And that's what we'll deal with today. All right, so just as a reminder, this is what the Gaussian process posterior looks like in the mean. We have kind of an error term between the data and our prior mean, our prior prediction that gets scaled by the inverse of the Gram matrix. You can think of that as how strict should I interpret this error? So if my uncertainty, my prior uncertainty is very large, maybe it's fine if I'm far away from the true data, right? But if my uncertainty is actually very concentrated, I should make a big update to be closer to the training data. And then the covariance evaluated at data points also involves this inversion of a matrix of the size of the data set. Now really GPU regression actually is two parts, right? It's not just computing the posterior, we also need to find the kernel hyperparameters. So do model selection in a way. So this would be your training loop during deep learning, for example. And the loss that we optimize for GPs is the log marginal likelihood. So given this kernel hyperparameter or these kernel hyperparameters, how likely is it that the data were generated from this model? And you can see that if we underfit, for example, here the length scale of the model, it's actually much wider than the true length scale that generated the data in black from the true function in gray dashed, versus the opposite end of the spectrum here, our length scale of the kernel. So the distance in which we assume data points to be related is much too short and we kind of overfit, right? We should expect to be actually, as we move away from a training data point very quickly, our model says we don't actually know anything about where the true function lies. Okay, so those are the two tasks. And last week we talked about the Julaski decomposition. So just as a reminder of the Julaski decomposition, sorry for GPU regression, what we need is, we need to do linear solves with this Gramian, size of the data set, the kernel matrix plus a noise. And we need to do that both for the posterior mean and then for the covariance as well, depending on how many data points we evaluate at here in data points. And then for model selection, we need the log determinant. We just saw that right here at the complexity term, kind of a regularization. And then we also need its gradient if you do gradient-based hyperparameter optimization to find the hyperparameters. And last week we developed that you can use a Julaski decomposition for this, right? Is this great problem solved, right? Does everything, computes a log determinant, computes linear solves, perfect. Here's an illustration of that. On the left we have a kernel matrix. I believe this is a matern three half kernel. And you get these Julaski factors, L and L transpose. Now, just as a quick aside, kernel matrices don't always look this nicely ordered, right? This is the one-dimensional data set where the data is actually, you can order the data just from smallest value to largest value, but in general for high-dimensional data sets, or if your data set is just not ordered, the kernel matrix doesn't necessarily look this structured because the correlation between two data points depends on that distance and two data points that I order next to each other don't have to be proximal in input space. Okay, but here's a big issue with the Julaski decomposition and you might have seen that in the homework already. I asked you to use your GP implementation to train on a large data set, 100,000 data points, and you should have probably found this, depending on your system, right? You're not gonna be able to form even, form that kernel matrix in memory because it takes about 80 gigabytes. So maybe you have a very, very big machine with a lot of RAM and you have 80 gigabytes of RAM, then you'll probably still face this problem. A modern CPU, excuse me, has about one gigaflop per second, so 10 to the nine floating point operations per second. If we plug that into the number of operations we need for the Julaski decomposition, one over three times n cubed random size of the data set, it'll take you about 83 hours to do that, more than a little overhead, okay? That's kind of not acceptable. Later in the lecture, I'll show you what I did this morning. You can do this in seven and a half minutes to desirable accuracy. Okay, so clearly the Julaski decomposition is prohibitive both in time and space for large data sets that we might care about. What can we do instead? We'll talk about iterative methods today and we'll try to reduce the complexity to squared in the number of data points. You've already seen this as well. This is again the Julaski decomposition and now I only show a couple of columns in the low rank, sorry, a couple of columns in the Julaski factor which gives me a low rank approximation for the kernel matrix. If I do I iterations of the Julaski in its iterative form, that's a rank I matrix. But that's not really what we need for GP regression, right? We need an approximation to the inverse, not the kernel matrix itself. So is there a way that we can also interpret the iterative formulation of the Julaski where we stop it early as an approximation to the precision matrix, it's another name for the inverse of the kernel matrix. Can we do this, right? Can we approximate the linear solves here, oh, down here with the inverse Gramian, which is I plotted here, somehow by a matrix that's called CI, such that somehow if I apply the matrix CI to some vector, that this is close to the linear solve of the kernel matrix. All right, let's take a closer look at this iterative form of the Julaski decomposition on the left here that we've seen last time. So just as a reminder, we put in a symmetric positive definite matrix A, in our case that's just a kernel matrix plus noise. And we get out a lower triangular factor, and if we stop it after I iterations, that only has I columns that are non-zero. And then we get an approximation to the kernel matrix. So what happens here? So really there's only two things going on, kind of. Down here, A prime is sort of like the residual, the part of the kernel matrix that we haven't learned. You can see that by, you can reformulate that as just A minus the current approximation of the matrix, Li times Li transpose, right? So this is kind of a matrix residual, let's call it. And then there's a second thing going on where as Marvin developed last week, this step here, you can interpret as multiplying this residual with a vector, in this case the unit vector up to scaling that selects a data point of the kernel matrix. That just selects a column of the matrix. And depending on how you choose these, you might get a worse or better approximation after I iterations less than N. Okay, now can we kind of rewrite this algorithm by tracking a few more quantities to also get an inverse approximation out of this? First, our goal is to find such a CI, an inverse approximation, and we know we have this matrix approximation, Li times Li transpose. Can we somehow kind of wrangle our algorithm to also get an inverse approximation? And what we could do, for example, if I just take this equation and I multiply from the left and right with A inverse, right, then obviously on the right side, I have A inverse here. And on the left side, I have my Cholesky factors that I multiply from the left with A inverse. Now that didn't really get me much because I'm trying to solve, or I'm trying to invert the matrix A, so to get an approximation for the inverse, I need the inverse itself that doesn't really solve the problem. But let's see what actually the entries of this matrix look like. So clearly the CI is actually a low-rank matrix, right? Because Li and Li transpose are both, only have I columns. And the last column is just A inverse times this Li vector that we constructed in step seven, and we just put into the next column of the Cholesky vector, okay? So if I just evaluate or just plug in the formula for Li, which is just A prime times this scaled unit vector, remember A prime is kind of the residual, the thing that we haven't learned about A. And I plug in my definition for the residual, right? That's the matrix itself minus the part that I've learned in the last iteration. And then I multiply A inverse into this, so I get an identity matrix here, that's what you can see here. And then I get A inverse times Li minus one. And then I can also just introduce a identity matrix on the right-hand side by adding A inverse times A, okay? And now what I have in here is just A inverse times Li minus one times A inverse times Li minus one transposed. But that's just the inverse approximation from the previous iteration. So now I've reduced this last column of this Cholesky factor to something that I computed in the previous iteration. So I can recursively also construct the inverse approximation, and that inverse approximation is actually also low rank in the same way that I construct the low rank approximation to the matrix A, okay? So really the Cholesky is actually also an approximate matrix inversion algorithm, okay? So now I added some terms into this Cholesky to kind of show what we need to track to compute this inverse approximation as well. So first of all, we select a unit vector that corresponds to selecting a data point if we do, if A is a kernel matrix, and let's call that action. Then I do a bunch of stuff and let's focus on the bottom party first. So you've seen that here we update the matrix residual, and now I've found a way to also compute this approximation to the inverse Ci, okay? And by definition of Ci, we said it was A inverse times Li, A inverse times Li transpose. I can rewrite this update also in terms of Ci, and what we'll find is that if I just take out A out of these parentheses, I just get A times A inverse minus Ci times A. So in here is the residual for the inverse matrix in the same way that out here, right? I have A minus some low rank factor, low rank approximation. In here, I have A inverse minus some low rank approximation of inverse. So hidden in here is kind of this residual. And then if I multiply A in from the right, I get this I minus Ci A factor, and now that pops up again up here in the algorithm. So I modify my action a little bit with this factor here, and then I multiply with this modified vector Di with the matrix A. And if I do that, what I essentially do is take this residual matrix, right? A times this factor times the unit vector EI. So that's the observation that you previously did in Cholesky, let's take a step back. That's this, right? Residual times EI. Question? Sure. So for us, sorry. For us, residual is the thing that we care about minus our approximation, okay? All right, so we have a way to modify our action to get some vector D, and it turns out that this inverse approximation is just at each iteration, I add an outer product of these vectors Di. Okay, and we see that down here as well, right? Now this I minus Ci minus one times A times EI just gets replaced by the vector Di. And let's do a quick complexity analysis. Is that actually more expensive than doing the Cholesky by itself? We know that the Cholesky per iteration takes over N squared, and let's take a look at the actions, at the computations that we added. Okay, so here, we have a matrix vector multiplication and another matrix vector multiplication that's still N squared. In this eta I normalization constant step, we also just do matrix vector multiplication. And then down here, this is a low rank matrix. So we can actually just track the low rank factors by storing the Di's, or even if we multiply this out with an outer product, this is also just O of N squared, okay? So per iteration, you just pay N squared to do this. So asymptotically, it's just as expensive as if we just approximated the matrix itself. All right, let's take a look at what this looks like if you apply to Gaussian process regression. So remember, for just computing the posterior, we need these two matrix inverses of the kernel matrix, size N by N. Now we have an approximation to this inverse. So what happens if you just replace that with this inverse approximation? We get this. So this is three iterations of the partial Cholesky, partial because we don't run it to convergence. And the data points that we selected or the unit vectors for our actions are just selecting the data points in blue here. So in sequence from the left to right, the first three. Now clearly, not the greatest approximation, right? Like basically back here, we haven't learned anything about the true function in gray dashed. Now, you might notice also that this kind of looks like GPU regression if you only did GPU regression on the first three data points, right? And in fact, doing the partial Cholesky with no noise is equivalent to GPU regression if you're, and you're selecting the data points that you're pivoting strategy select, so your unit vectors. And if you add noise, it's almost the same, not quite, okay? So you can think of the partial Cholesky as something like doing GPU regression, but conditioning only on a few data points. So last time we already saw that it really matters what pivots you choose or in other language, what unit vectors you choose, what data points you select, right? It's just all names for the different thing, all the same, all different names for the same thing. And it turns out that here, that's also the case. So on the left, I just chose, I think this is 10 iterations of the partial Cholesky. And as unit vectors as actions, I chose those actions that selected the first 10 data points on the left here. And you can see that in this region, my real have learned the true posterior mean in black quite well, all the way over here, we essentially have learned nothing about the true posterior that we're trying to approximate. Now on the right, I just chose a random order for the actions, so random selection of data points that I consider a random selection of columns of the kernel matrix that I look at in the partial Cholesky. And you can see that this looks much better, right? The posterior is much closer to the true posterior in dashed and black lines. And also, we can imagine that we've learned quite a lot more about the function and we're much more certain about where it is. So this choice of data points not only matters for approximating the kernel matrix, in particular it also matters if you approximate the inverse and thus for GP approximation, okay? So you can think of this selection of actions, the selection of unit vectors for the Cholesky as a kind of act of learning. In the sense that if I had a strategy that while I do this at each iteration and picks kind of better data points to condition on, I can actually learn much faster as is exemplified here by the random order. But you could also think of a different strategy that selects data points in a smarter way and we'll see that later, okay? So the selection of data points that you choose really matters for how quickly you converge to this inverse approximation and thus to the GP posterior. So how do we do this smarter? Can we find better actions to do this? So on the left, there's another illustration of this. So inside the part of Cholesky, there's a step where we take the residual A prime. So the thing that we haven't, or the part of the matrix that we haven't learned yet that we can rewrite via this, so via this part where we correct our action with the inverse approximation from the previous step. And this looks like this. So this is the residual after also tenor durations of the partial Cholesky. And every time I select, I chose a unit vector that selects one of these data points. And you can see that most of the data points here have been selected, the ones here and the ones down here. I think those are maybe 30 data points or something like that. So in each step, I choose a data point and the residual of the matrix look like this. But why do I have to restrict myself to a vector that looks like this, right? The operation, the cooperation cost per iteration is squared in the number of data points anyways. So I can afford a matrix vector multiplication here, right? I can choose any vector. I don't have to select a column. So what if we found a better vector with some different entries, maybe even more than one, to more quickly approximate the kernel matrix or the inverse, which we just learned, it's kind of the same. And here, I chose some strategy that I'm gonna tell you about in a moment and clearly, just visually, you can see that it's more close to the true matrix, right? The true kernel matrix. So what's a better way, more efficient way to choose these actions? Okay, so this is the second part of the talk and we'll talk about a method called the method of conjugate gradients that's originally been designed to solve linear systems as they pop up, for example, in GP regression with symmetric positive definite matrices. Kernel matrices are symmetric positive definite, so that's great. And I will dive in kind of a little bit to the classic view of how this method works and then I'll pull you back into this view of actually learning about the matrix or learning about the inverse as we've already seen. Okay, so method of conjugate gradients. Our goal is to solve a linear system, x equals b, so a, you can always just think of as a kernel matrix for our purposes. And you wanna do that with, ideally, very few matrix vector multiplies. So just a few quadratic cost operations on the size of the data set to quickly approximate the posterior. Now, a different way of thinking about this problem, a, x equals b, where a is a matrix and b is a vector of size n, you can rephrase this as a quadratic optimization problem and let me show you how that works. So take this quadratic function where the argument x is a vector of size n. If you take its gradient, then setting the gradient to zero, so solving this or optimizing this quadratic, which is a convex optimization problem, so it's a global minimum, is actually equivalent to solving the linear system a, x equals b. Okay, so taking the gradient of this, just this a, x minus b, setting to zero means the equivalent to a, x equals b. And now I can also define something, another residual here, which is b minus a, x, so if I plug in something for x that tells me how, while I've solved the system, or equivalently, is my current negative gradient of this system, okay? So it points in the direction of steepest descent. So what's your favorite optimization algorithm? Just shout it out. Nobody has a favorite optimization algorithm? All right, so what if we just throw gradient descent at this problem, right? Why not? Gradient descent's super easy. We can compute the gradient. It's just b minus a, x for the current position x. Why don't we do gradient descent? All right, so here on the right in blue, these contour lines are the acupotential lines, so at which the function has the same value. So the blue contour lines are the function f of x. The minimum, so the solution is here, x, which is equivalently the minimizer of this function and as we just learned, the solution to this linear system. So if I do gradient descent, I follow the direction of steepest descent. What I get is the green line here, so I start at x zero here. The direction, the green line is orthogonal to the acupotential lines. Take one step and I can actually minimize the one-dimensional quadratic along this distance in closed form. So I land here, right? If I stepped any further, I would actually increase my function value again, so that's great. And then I do it a bunch of times, so one, two, three, four, five, maybe six times, six steps. Fine, right? So one observation here, these gradient steps for this quadratic are always orthogonal to each other in the Euclidean inner product, okay? Can I do any better? Anybody have any ideas? What if I slightly modify the directions that I walk in, not to be orthogonal with respect to the Euclidean inner product, so the inner product that induces the two-norm, but that are orthogonal with respect to the a-norm, where a is our matrix, okay? What does that look like? So here in red, we have one example of this. So take one step, and then the next step is orthogonal in the geometry that's induced by these blue lines. So what does that mean? Imagine you could squish this landscape, this blue landscape such that the blue lines are actually a circle, okay? You could squish everything. In this squished space, what the red path does is it takes right angle steps, okay? Now of course, we don't have this squished kind of space. We have this elongated space. So in this elongated space, taking orthogonal steps it actually seems to be a fine idea, but takes six steps, versus if I take right angle steps in the squished space, it seems to be just take two steps, and that's actually a general result that you can show here, that if you follow such conjugate directions, let's call them conjugate being another word for orthogonal in an arbitrary, in a product, the effect on arbitrary in a product, I can converge in at most n steps. And now let's do a slight modification of this, because for the first step, we don't have decided yet where we should go, right? We can go in any direction. Why don't we just take our gradient, negative gradient as a first step, walk into the direction of steepest descent, and then just modify our steps always to satisfy this conjugacy condition, for which we then know, we can actually converge for this two-dimensional problems in exactly two steps, in exact arithmetic, okay? So that's the idea, and this is what the algorithm looks like. So don't get too intimidated by all the symbols, I just want you to kind of get the gist of what the individual steps do in this graphic, okay? So I'm just gonna explain kind of the high-level parts of this, and it's not super important to know exactly how to derive these steps. We'll do that next lecture in a more general form. So my algorithm gets a matrix A, a vector B, and an initial guess for the solution x zero. I have some stopping criterion, for example, based on the gradient, right? The gradient norm, if that zero were at the end here, and then in the beginning, I compute the residual, which we just learned is just a negative gradient, right? So I compute the steepest descent direction. Then to actually figure out in which direction I should search, in which direction I should walk, called search direction, I take my negative gradient, step r i minus one, and then I do some correction, and that correction, in this case, depends on the previous step. So that's the intuition that when I take a gradient step in green here, and I can't take the next gradient step, I have to slightly correct it for what I've done previously, okay? So that's this correction part here, this minus, depending on the previous search direction. And then finally, based on the search direction, I can update my estimate for the solution with some step length, and the step length is the exact minimizer along this direction. And you can see here, by the accurate potential lines, if I would walk farther, I would actually increase the function value again, okay? So this is the minimum along this direction. All right, so this is the method of conjugate gradients, great. But remember, we're looking for a method that approximates the inverse, because we don't only have to solve one linear system for the mean, but also multiple ones for the posterior covariance, right? Because we have this right-hand side is a kernel evaluated at the training data set in new data points. Okay, so can we also do the same thing or similar thing as we did for the partial Shulaski by interpreting this CG method as one that also constructs an approximation to the inverse? And the answer is yes, of course. Otherwise, I wouldn't be telling you. So now we also want a matrix CI that approximates the inverse from this method. And we'll see that this is also low rank in the same way that it was low rank for the partial Shulaski. And in black, I just carried over the steps and maybe slightly reformulated them. So let's take a look at that. So here for the search direction, sorry, in the beginning I compute the residual again, same thing, and then I just call that residual my action, okay, SI. And now correcting this residual as we did here, turns out we can write as just the residual itself minus the correction that depends on the previous inverse approximation, okay. So this term here that looks very complicated can rewrite with the same structure as we had for the partial Shulaski. That's the conjugacy correction. And it turns out that this search direction, if we take the auto product up to scaling and sum that up, that turns out to be this inverse approximation. And to give you an intuition, sort of very high-level intuition of why that works. So if I take this form of the approximation of the inverse and I multiply the right-hand side vector B, right, up here, if this was consistent, this inverse approximation with my solution estimate XI, it should be the case that if I multiply CI, the inverse approximation with B, I should get XI. Otherwise, I'm actually sort of developing two different estimates for the solution of this linear system. So it only makes sense if the CI actually is equal to XI if I multiply it with the right-hand side because it's an approximation to the inverse. So now we know that the update for the solution estimate involves this conjugate direction DI, right? That's the update up to scaling. So this matrix CI here, if that is to make sense, has to, or if it's low-rank, has to have the form all previous search directions stacked into columns times all the vectors stacked into rows because then I get a linear combination of these search directions DI as updates here. And that's exactly the form that we see here, right? We get one DI times row vector DI transposed per iteration. So I realize that that's a lot to take in, but it's actually not so important to understand where this comes from because we're doing that next week. But think of the, what I want you to get is this intuition here, okay? I can select actions that actually solve this linear system very quickly and we'll see that in a moment. Quick comparison to the partial Cholesky that we saw before, you'll recognize now a bunch of things here, right? Here we selected a unit vector as an action. Here we're selecting the residual or negative gradient. Here we're slightly modifying our action, doing the same thing over here for the residual to get the search direction. We're estimating an inverse here with a low-rank, actually rank one update, same thing over here. Now of course, for the Cholesky we didn't actually solve a specific linear system so we didn't have a step where we estimate a solution, but actually we could compute the solution here as well by doing the same thing, multiplying the inverse approximation with the right hand side, okay? So these algorithms are actually very similar, we just typically understand them from very different angles. Okay, so let's compare them for GP inference. So we just again replace the kernel matrix inverse with this CI, the inverse approximation. And on the left here, we have the partial Cholesky for tenor durations with randomly sampled data points. You can see that where we pick data points, the uncertainty is contracted. And here is conjugate gradient, which automatically we don't even have to specify what data points to pick because of this residual action strategy actually is much more contracted and matches the true posterior and black dashed lines and the true posterior mean and black, much better. The red is our approximation, the black is what we want. So this intuitively kind of shows that the method of conjugate gradient seems to converge much, much faster. And what is this interpretation of these actions? So the actions you can think of as probing the matrix that we care about that we want to invert in a certain way. And however you structure that action probes the matrix in a different way. So here for this GP approximation, we kind of probe at this point, then we probe at this point, then we probe at this point, and then at this point, and the GP approximation contracts around for CG, our residual vector in general probes where our estimate of the posterior mean is still far away from the data set. So it probes globally at every single data point in general and scaled with how far away I am from the data point. That's exactly what the residual is. If you take a look back here, it's for GPs, it's the data, B is the data Y, minus the kernel matrix times the representables. So that's the intuition why this works so well for GPs. So notice, I'm already using terms that you might know from kind of other machine learning courses, action in terms of deciding where to evaluate or where to observe, learning about the inverse matrix and we'll see a lot more of this language later. Okay, so we saw that, apologies, there we go. So we kind of intuitively, or based on this picture, it seems like the method of conjugate gradients converges much faster, but actually how fast does it converge, right? And now I have to take a quick numeric's interlude to actually analyze this. For those of you that haven't been in contact with numerics too much, here's a quick slide for the three most important concepts from your numerics lecture. I hope I'm not stepping on anybody's toes here. So the first one is machine precision. Whenever we're working on a computer and not just in abstract mathematics, there's unavoidable rounding error because I can only store my numbers on my machine up to a certain precision. So I call that here just the floating point rounding error FL of some theoretical quantity X that I can store in my machine as X tilde. Now, whenever I have such a distortion on my disk and I apply some operation to it, for example, solve a linear system and I store my matrix and my right-hand side on my computer, those are suddenly slightly distorted, potentially. And if I apply the exact algorithm for inversion, so the theoretical mathematical operation that inverts that matrix, I still have unavoidable error that gets introduced because my inputs are slightly distorted. So also my outputs get distorted, this bigger white circle here. And one way to measure this unavoidable error amplification through the mathematical operation that I'm trying to implement in an algorithm, for the case of solving the near systems, that is called the condition number. So technically it's unavoidable relative error amplification but you can think of it as just unavoidable error amplification. And that turns out to be the largest eigenvalue divided by the smallest eigenvalue in absolute terms. Okay, so if you think of the matrix representing a quadratic function as we had just a few slides ago, here the blue lines, in this case, how elongated this quadratic is in its largest elongation direction divided by its smallest elongation direction. Sorry, the other way around. Smallest divided by largest. But its largest eigenvalue divided by smallest. Okay, and then there's a second quantity and this is just as a service. Now what happens if I actually try to implement this theoretical mathematical operation in an algorithm if I've had like the conjugate gradient method or the Tulaski, then there's additional error coming from this implementation maybe doing a rounding in between or just not being true to the true mathematical operation. And if then my error is not too large, not too far away from this unavoidable amplification, I call that stability, okay? Now for example, the Tulaski is a stable algorithm. So if you solve a linear system on your computer with a Tulaski decomposition, the error you should expect is on the order of the condition number depending on what norm you measured in, okay? So if that's really large, no algorithm in the world is gonna be able to solve that linear system to machine precision. Machine precision, so in double precision is roughly 10 to the minus 16, okay? So if my condition number is 10 to the 16, good luck finding an algorithm that's gonna solve that linear system to some precision that you desired that's smaller than one, okay? All right, so now we can analyze the convergence behavior of the method of conjugate gradients. Here on the left here, I have a system with a kernel matrix that has a thousand data points and has a moderately large condition number, 10 to the five, and in the x-axis I have the number of iterations I run for CG and on the y-axis I have the error in A norm, so that's the true solution of the linear system minus my approximation in A norm. And it turns out that, so first of all, let's observe that we can't actually reach the double machine precision accuracy at roughly 10 to the minus 16 because our condition number is pretty large. Now it's not exactly that we can just add five orders of magnitude to this because technically this is in two norm this is in A norm but certainly we cannot reach 10 to the minus 16 here. When it turns out for this method of conjugate gradients, the convergence rate, so how quickly it converges to the true solution is given by this condition number of the matrix A, by the largest divided by the smallest eigenvalue to some power i. So if A is very close to one, does this converge faster or slow? If A is close, sorry, if the condition number of A is close to one, does this term decrease fast with i or slow with i? What is the numerator if kappa of i is close to one? Just shout it out, close to zero, okay? What about the denominator, close to two? So this is almost what? So if now I take that to a power, does that decrease fast or slow? Compared to if, let's say kappa of A is a million. What is the fraction here? Roughly, almost one. So what converges faster? If kappa is a million or if kappa is one? One, right? Because then this thing here is almost zero so this arrow decreases really fast, okay? Because if this is a million then this is almost one and if it's one it doesn't decrease at all. So the closer to one we are the slower we converge. So here I should expect actually not such fast convergence, right, the condition number is quite large. Turns out it converges still quite okay in a few hundred iterations relative to the problem size a thousand that's not too bad. By the way, I only chose such a small problem here because I actually need to compute the true solution to evaluate this, right? So I need to be able to compute a Chulaski to get a solution that's within unavoidable error. So we can see that CG converges fast if the condition number of my system is small. But does that mean it converges fast for all cases that we care about, for example, for the kernel matrices? Here, similar example, 1000 data points, a is a kernel matrix and I vary the damping for the observation noise of my model. I increase it from blue to blue, from red to blue, sorry, I decrease it from red to blue. So I assume less and less noise in my observations and you can see that the convergence of CG actually slows down as I decrease the observation noise. And the reason is let's take a look. Here's the kernel matrix K plus sigma squared I, K is an SPD matrix or at least symmetric positive semi-definite because it's a kernel matrix. So we can do an eigen decomposition where Q are the eigenvectors and lambda is a diagonal matrix with the eigenvalues on the diagonal. Now we can find an eigen decomposition where Q has orthogonal eigenvectors, actually often normal eigenvectors. So I can just introduce an identity matrix here, Q times Q transpose is identity. And I can pull in the sigma squared I to actually find the eigenvalues of K plus sigma squared I and they are the eigenvalues of K, lambda is a diagonal matrix with the eigenvalues that where each eigenvalue gets a sigma squared added to it. So now if I wanna compute the condition number of this kernel matrix in the beginning, I just need the largest eigenvalue of the kernel matrix, sorry if this damped kernel matrix in the beginning, I just need the largest eigenvalue of the kernel matrix without damping plus the damping, okay? And now you can see what happens if I take sigma squared to be very, very small. What happens if I take sigma squared to zero to what does this term converge? Just shout, take this term, let sigma squared go to zero, what does it converge to? Just the condition number of K, right? Lambda max of K divided by lambda min of K. It turns out that for most kernels, these are very, very different numbers. Such matrices are very, very badly conditioned. So lambda max is much higher, much larger than lambda min. So if I increase sigma squared, the opposite happens, right? In the extreme case, if I let that go to infinity, then this converges to one. Obviously then I assume that I have very large observation noise, so I don't learn anything about my problem. So there's this kind of trade-off going on. The more noise I assume, the easier my numerical problem gets, but the less I actually learn about the function that I'm trying to learn. And to give you an intuition why this condition number of K is large, the condition number is something like the largest, is the largest eigenvalue divided by the smallest eigenvalue. Now, if I have two data points in my data set that are the same, two rows in my kernel matrix are the same, right? Because the row of a kernel matrix are the kernel evaluated at the current data point and all other data points. So if I have two rows where I have the same data points, I just have the same rows. And obviously then that matrix has a zero eigenvalue, has low rank, right? Doesn't have, sorry, has rank less than n. And if it has rank less than n, then this eigenvalue has to be zero. So as data points get closer to each other, the condition number of a kernel matrix blows up and close to each other is relative to the kernel length scale. So if observation noise is small, so sigma squared, let's say, is very close to zero, and I have close data points in input space, I have a problem, because then CG converges slowly. So what can we do about this? The answer to this is preconditioning. So assume we have a computational tractable approximation, P to the matrix A. So we know something about this matrix, in this case, the kernel matrix. And we assume that the storing this matrix is rather cheap relative to storing the matrix A. We can actually efficiently evaluate its inverse, pseudo-inverse, and maybe we even know it's determined. So we know a lot about this matrix that's approximately A, and the idea is we solve an equivalent linear system where we just multiply from the left with P inverse. And this system still has the same solution X, right? I haven't changed anything, except that I multiply from the left, but this equation still holds, so X is still a solution to the linear system P inverse A as a matrix, and P inverse B as the right-hand side. And now if I can choose my preconditioner such that this condition number is much, much, much smaller than the condition of the original matrix, I've gained a lot, right? Because I've replaced my original problem with an easier to solve problem, where my method CG converges much faster. And the intuition you should have about this preconditioner is it's like prior knowledge about my problem. I sort of know something about this system that I'm trying to solve that makes it easier for me to solve, in the same way that, let's say I take a probabilistic learning method, if my prior is very concentrated around the true quantity that I'm trying to infer, I can learn very quickly, or I don't have to actually collect a lot of data. And preconditioning is the same intuition. And geometrically what preconditioning does is this is a 2D linear system again with interpretation as an optimization of a quadratic where blue lines are the contour lines, and I have a large eigenvalue and a small eigenvalue. That's why it's kind of elongated. Now, if my preconditioner is just the inverse itself, it immediately snaps the matrix A to an identity matrix, and I don't have to do anything because it just says X equals P inverse B, and I have a solution, right? So P kind of transforms my problem in such a way that it's slightly closer to the identity matrix, and then easy to solve. And it turns out that because of this property here, that the condition of the system gets lowered, it actually accelerates CG. And in fine precision it actually makes it also more stable, but we won't go into that. Okay. So let's try to get an intuition for how we can use this form of prior information, and I use that term loosely, next time we'll make it precise, to quickly, to solve linear systems, quickly for GP regression. Okay. So if I now use CG with a preconditioner, it's actually very simple. The only thing I have to change is, I don't look into the direction of steepest descent, R i minus one, so the residual or negative gradient, but I slightly warp that based on this preconditioner. And what happens is, here's again, number of iterations of CG in lock space, and the error in A norm to the true solution. As I increase the quality of my preconditioner here, I'll explain the type of preconditioner in a moment, you can see that I actually converge faster, right? The arrow decreases, and this is in lock space, so I need a factor, two orders of mag, sorry, more order of magnitude, fewer iterations for this problem, is again a kernel matrix. How do we get such a preconditioner? Well, we can actually use the partial Scholesky that we already introduced here. So one way to precondition kernel matrices is to approximately, approximate the kernel matrix K, add the noise, and then this is a low rank approximation of rank L, right? I just run the partial Scholesky for a few iterations. And then I can store this thing in, it's just L vectors of size N, so it's O of NL, which is cheap compared to the size of the matrix. And this is a low rank plus diagonal preconditioner, so as you learned last week, you can invert it via the matrix inversion lemma. Okay, so you can also compute the inverse, P inverse here very quickly, and last week we learned that's O of NL squared, so even linear time, and we can afford quadratic time per iteration. So that's also the preconditioner that we used up here. Yeah, it's true, so you could even construct an even better preconditioner by selecting better data points, very true, but it turns out even doing a very dumb strategy just randomly sampling really accelerates my method. And the issue is if I would just run CG without preconditioning, it doesn't converge as fast because of the convergence statement, okay? So somehow that's a really good method, but we have to warm start it maybe a little bit to get that benefit. That's kind of the intuition. Okay, so now let's try that out. Here's a large-scale linear solve, and I promised you in the very beginning that I'm gonna reduce the time of 83 hours of solving a large-scale linear system with 100,000 data points to about seven minutes. Here's such a data set, matrix of size 100,000. This is a kernel matrix, it's a matern three half kernel. I use a preconditioner, I run it, depart Cholesky for 20 iterations, that takes about one and a half minutes on my machine upstairs, and then I run the precondition Cholesky, and what you see is that the residual converges to about 10 to the minus a, 10 to the minus nine in 60 iterations, okay? So actually we could have stopped this even earlier maybe in about three minutes and would have gotten a very accurate solution. Now one quick side note, I'm not showing the error in a norm here because I would have to compute the exact solution, but I can't do that, because the system is too large, right? So I can only show you the residual norm, but that's the proxy that we use to track how accurate we are. And the residual norm is almost the same thing as the error in a norm, it's the error in a transpose a norm, okay? So you can see that I can now do, I can compute a GP posterior for a 100,000 data point data set in maybe seven minutes on my desktop machine upstairs. Now you might ask, well I have to do more than one solve, right? I have to maybe during hyper parameter optimization do multiple solves. Actually you might be able to reuse preconditioners for this or if you actually have multiple systems that you need to solve this cost also on mortises because you only have to compute the preconditioners once. Okay, but what about hyper parameter optimization? I've only told you how to solve linear systems. I haven't told you anything about how we actually do hyper parameter optimization. For Chulaski, Mavin told you they can just compute the log determinant by taking the product of the diagonal of the lower triangular factor. How do we do this for CG? We need to, remember, we need one optimize, when a fine parameter is theta of the kernel like the length scale to optimize the log marginal likelihood that's our loss or objective function. So we need to evaluate the loss to kind of check how good we are. And most importantly, we need to compute its gradient so we can actually optimize with a gradient based optimizer. So it turns out that the gradient of the log marginal likelihood here, remember it has a fit term here that involves a linear system solve and a log determinant. If I take the gradient with respect to the hyper parameters, I get this term for the fit. So that's again two linear system solves with the data and a trace term here as the derivative of the log determinant of the kernel matrix that also involves a linear system solve. So we have two costly operations to do this hyper parameter optimization. We need to solve linear systems, k hat inverse times a vector, most of the time it's the data minus our prior mean. And in this case, it's also a linear system solve with the derivative of the kernel matrix, but we've solved this. We can use CG, that seems to be really fast. We have to precondition fine, but actually we can scale that up to, as you just saw 100,000 data points. But what do we do about these things? We also have to compute traces. Arguably, maybe we could say we can get away with not evaluating the log marginal likelihood and just terminate when our gradients are vanishing, maybe that's fine, but that still involves a trace of a linear system solve. Now, how do we compute these traces? Let's take a step back and look at the definition. So the trace of a matrix is the sum of its diagonal, so the sum over the diagonal elements AII. One way I could write this using matrix vector multiplication is just multiply from the left and right with unit vectors that select the diagonal element. Actually, it's also equivalent to the sum of the eigenvalues, but this is the thing that we all wanna look at for now. But our problem is, we can't afford cubic computation. We can't afford n matrix vector multiplies because that would be n times n squared cubic. We want to ideally approximate this thing to high accuracy with l much fewer than n matrix vector multiplies, because if we can do that, we can solve the near systems with small number of iterations i and we can compute traces with a small number of iterations l, and hopefully i plus l is much smaller than n, then we found a method that approximates gush process regression in quadratic time on very large scale data sets as we just saw. So, let's make an observation about this trace of a matrix. Say we had an orthogonal matrix Z so that C times Z transpose is just the identity matrix. Then we can just rewrite the trace of A as the trace of A times the identity matrix which is introducing a one. And by trace rotation, I can just pull this matrix to the front and I get the trace of Z transpose A Z. And if I just use the definition of the trace as selecting kind of the diagonal elements, what I get is it's selecting Z i transpose A Z i or I get terms that are Z i transpose A Z i. And that's also actually the proof for this being the sum of the eigenvalues because if you choose of the normal eigenvectors for this, you just get back the eigenvalues, right? Because A Z i, if Z i is an eigenvector is lambda i times the eigenvector and Z i transpose Z i is just one because it's an orthonormal system, okay? But that still doesn't get us all the way there because that's still in actually two N matrix, sorry, N matrix vector multiplications, okay? Crap. Well, what do we do if we don't know what to do? Why don't we just randomize? Why don't we just draw a small number of random vectors with a certain distribution, in this case, zero mean and covariance such that if I scale it with the square root of the dimension, I get the identity covariance. Maybe I can hope that that's actually a good approximation of this thing up here. And let's see whether that works. The trace of A, I again can introduce just an identity matrix. Now, I posit that the covariance of square root of N times Z i equals the identity. Then this square root of N gets squared and pulled out, traces linear, I get an N up here and the trace of A times the covariance of Z i but the covariance of Z i is nothing but the expected value of its square for vectors that Z i times Z i transpose minus the squared of the expectation but the expectation is zero. So I just get the first term here. And now I can pull out, sorry I can pull in since the expectation is linear, my matrix A, so I get A times Z i Z i transposed. I can flip the expectation and the trace because they're both linear and I can do trace rotation to pull up the Z i transpose to the front so I get the trace of a scalar now and of that the expectation. Now the trace of a scalar is just the scalar itself so N times the expectation of this quadratic form Z i transpose A Z i and now I can approximate this expectation with Monte Carlo so I only draw L random vectors to get an approximation of this trace. That's the idea, the intuition you can take from up here, if I just choose a bunch of Z i maybe that's already a good approximation to the trace and that's the idea of this randomized approximation. So now I found a way to also reduce this problem of trace estimation to matrix vector multiplication which is the operation that I know is quadratic time. I can do it really fast on a computer and hopefully I only need a small number of these random vectors but I don't need a small number of random vectors. Here is a plot that shows the number of random vectors on the X axis for the estimation of the log determinant. Now remember last time you saw that you can rewrite the log determinant as a trace I'll show you that again one second here the log determinant of the kernel matrix you can rewrite as the trace of the matrix logarithm of the kernel matrix. So if I could evaluate the logarithm of the kernel matrix I could just do this trace estimation strategy that we just outlined. Let's assume we can just compute this function of a kernel matrix for us it's always the log if you care about the log determinant and not care so much that we also have to estimate this in the middle let's just say we can do that. And what you'll find is here if I look at the number of random vectors that I take from my Monte Carlo estimate the estimate for the log determinant that I get through the stochastic trace estimate actually has a lot of variance at the beginning you can see that I'm qualitatively pretty far away from the true value in gray dashed that I want to estimate. And I literally need tens of thousands of random vectors and this is actually a problem of size 10,000 so I found a strategy to solve my problem that is just as expensive as Cholesky which I can't do on large datasets so crap what do I do? Well this is a property of Monte Carlo estimators that they just converge at this rather slow speed one over the square root of the number of random vectors or random samples. This is gonna be an issue because it really slows down my hyper parameter optimization and I just simply can't afford to do that. So what can you do? You can again use something that we already know. We have a preconditioner for our linear system solve so we know something about approximately how the kernel matrix looks like. And it's inverse. Can we use that to solve this issue here? You can do that by, if you look at the log determinant you can introduce a one identity matrix by just introducing the preconditioner you can think of a partial Cholesky here if you want and it's inverse. And then using the rules of the matrix logarithm I can pull out this log determinant of the preconditioner and we assume that that was known. So we have some part about this log determinant K hat that we already know. And then I get a residual here again that's something like the trace of the difference between the matrix logarithms. And you can already see that if the preconditioner P hat gets close to K hat this term becomes very insignificant. And my log determinant of the preconditioner is very close to the true log determinant. So the strategy is use the part that we know our prior information, pull it out and estimate this trace back here via the stochastic trace estimate. And we can do that and what we'll find if you do that is that the estimator converges much, much quicker literally a difference between tens of random vectors and tens of thousands of random vectors to reach a qualitatively similar level. And that's the same kind of behavior you expect from CG, right? If you have a good preconditioner you convert much, much faster as we saw previously. Turns out you can also use the same information that you have about the matrix also for this trace estimation. Now for the gradient of the log determinant you can do the same thing, you just differentiate once through this thing and all the quantities that you need are available for example through automatic differentiation if you implement this in let's say PyTorch. And you recover and this is not super important, you recover the same rate at which the preconditioner approaches the kernel matrix for this estimate to get a speed up and qualitatively you see this here, right? You tens to 100 random vectors is what we need to get the same error level. And this is how GPyTorch, which is the most popular library in PyTorch, Python for Gaussian processes works internally. So it uses CG if you just use a vanilla exact Gaussian process from GPyTorch uses CG to compute the linear system solves precondition CG and then it uses this preconditioned log determinant estimation and derivative of the log determinant estimation to do hyperparameter optimization. And with that you can scale to data sets, these are all data sets on the x-axis from the UCI machine learning repository. I think some of them are not, but they have data set sizes between 25,000 and a million data points here. And now this is on a GPU because even matrix vector multiplies get significantly difficult if your system is just large enough but you can paralyze them easily. So with these strategies that we just showed you, you can within about an hour train on a data set with roughly 250, 280,000 data points and up to a day for a million data points if you have GPUs available, okay? And internally that works exactly how I just showed you. Now, just to give you a notice about what pre-training and not pre-training is, pre-training is just sample a small data set of the data set that you're training on to get hyperparameters that are approximately good and then start from there, start optimizing from there. Yeah? Yeah, well that's what it's called. Sure. So. So the question was, can I explain again how, essentially this part, right? Why do we get this increase in the rate? So clearly if the pre-conditioner, if it's a low-rank pre-conditioner, it depends on the rank L of the low-rank part. If that gets close to pay hat, this part goes to zero, right? But to actually figure out how good our estimate is, turns out this contribution depends on how quickly the pre-conditioner converges to the kernel matrix. Could imagine that you need to take a lot of pre-computation to get a good estimate of the kernel matrix. Then that would not help you very much because it's very costly to do that. If this thing converges very, very quickly to the kernel matrix, you can invest very small amount of pre-computation to do this variance reduction strategy to this quicker convergence strategy, okay? And it turns out that if, typically you know something about how fast your pre-conditioner converges and you inherit this rate specifically also for the Monte Carlo estimate. So instead of one over square root of the number of random vectors you converge with a speedup that's polynomial, sometimes even exponential. And what that means is you know, you converge to the true value much, much, much faster. Okay, so now you know how you can do this large scale with iterative approaches. You can scale Gaussian processes to large data sets. Let's do a quick recap of observations before we move to the final part. So you've seen that we kind of used language of iterative solvers choose these things called actions which are vectors that essentially determine how to observe my matrix, my kernel matrix in our case. This choice of how to observe the kernel matrix through the matrix vector multiply really affects how fast we converge. Okay, you saw that because CG which used the different strategies used residual vectors converge much, much faster in just a few steps relative to the problem size which was 100,000 for the example that I showed we had about what's 60, 80 iterations. Now choosing these actions can kind of be interpreted as the solver choosing itself what to observe about the numerical problem in this case solving the linear system. And we know that if you put in more information about our problem, if you already know an approximate inverse of the matrix through a preconditioner, we can accelerate these methods either for linear system solve like for CG or for this precondition log determinant estimation through the stochastic trace system. So in conclusion, for these to develop fast numerical algorithms for the Gaussian process case, you really need some sort of domain expertise in the sense that these solvers need to know something about the problem that they're solving through the preconditioner ideally or through the choice of actions to quickly converge. If you just do naively partial Chalesky, you saw that that didn't really look like the G posterior if you just selected the data points in order. So the algorithms themselves need to know something about the problem to converge fast. But we can tell them how to collect that information to warm start themselves. Can we do this even faster? I said now we have a method that uses a small number of quadratic operations, matrix vector multiplies to scale to large data sets. But can we do this even faster in linear time in the size of the data set? So our inference method scales not with the problem size for the power of two, but just with the problem size itself. So this idea is actually a little bit older than the iterative methods approach. Not the specific idea, but the idea of linear time GP inference and I'll show you kind of the most modern version of it. And here again, this is just to give you a kind of a slight quick overview. We won't go too deep into this. So one observation you can make about any data set basically, specifically the low dimensional that the data points in them are often quite similar. You often have data points that are quite redundant. You get training examples where the example that you get is similar to one that you've seen previously. But wouldn't it be great if we could compress that data into a smaller training data set that's called inducing inputs because it's induced from the training data that kind of summarizes the training data in a way, right? So if I have a million data points, what if I could find a hundred or a thousand data points that just really will summarize the data that I have in my data set? So the idea is we can do this through something called variational approximation. So we directly approximate the posterior instead of using numerical algorithms to compute the mathematical operations in the posterior. So this is changing the model for your Gaussian process. So you're not actually doing GP regression anymore and approximating that, but you're using a different model to get as close as possible to the Gaussian process posterior. And one way to do this is called stochastic variational Gaussian processes. And what they do is they posit a different Gaussian process that depends on Z, these inducing inputs, the summary of the training data. And the posterior here looks very similar to the one you know for GPs, right? So let's ignore this correction term for a moment. Then this part here looks like GP regression without noise on these inducing inputs because we have kernel function evaluated new data points and then here kernel matrix evaluated all the inducing inputs inverted times something. And for the kernel, we have prior kernel minus an update term that again involves something that is the kernel matrix evaluated at the inducing inputs inverted. So if we had no noise, this looks almost like standard GP regression just on the inducing inputs. Now it turns out that for this to be kind of smart, you actually have a correction term and you can already think about that. Actually the observation noise pops up nowhere here, right? So how does that enter? Turns out it enters here. So there are two more parameters of this term. One is the vector mu here and then this correction term kind of dampens the, adds more uncertainty. So I have three things that I can modify to get a good approximation to my Gaussian process posterior that is the summary of the training data, this vector mu here that plays a role like the observations, the label vector and a term that adds additional uncertainty through sigma here. And what we can do is we can evaluate the KL divergence which is the statistical distance between the approximation and the true GP posterior. And we can do that to optimize the summary of the training data, the vector that represents kind of our observations at Z and this additional uncertainty term that corrects for that we're only looking at a small data set, a small summary data set. And we can do this in O of I squared N and, oh, sorry, I lost my pointer. And let's take a look at how this comes about. So in black, you can see that we have to invert the kernel matrix at the inducing inputs. We only have I inducing inputs, a small number, maybe a thousand instead of a hundred thousand data points. So this inverse is really cheap because it's only of size I by I. Any other terms that we find in there are just evaluating the kernel matrix at these I data points and then maybe at the training data set during training. Okay, so all the operations that we need are either I cubed or I squared times N because N is the size of the training data. Where the one exception is evaluating, oh, look at this. With the one exception being evaluating the kernel matrix at the entire training data set, but that doesn't pop up during optimization of this. So the idea is summarize the training data, find a direct approximation to the posterior and that direct approximation only takes O of I squared N. Great, right? Should we forget about iterative methods now? So there's different ways you can do this. The reason for the naming stochastic here is typically you do something like batch optimization or SGD. So you can really quickly optimize this with your favorite optimizer typically. So the most expensive operation is asymptotically in this, but there's a number of optimization steps that you do. But the most expensive operation in the training data set size is N, okay? Remember for iterative methods, we also have to do a hyper parameter optimization. So we also have to solve multiple problems at correct complexity for one for each hyper parameter optimization step. But the number of hyper parameter optimization steps is typically way, way, way smaller than the problem size. Okay, so great, right? Now I've spent maybe an hour telling you about iterative methods and now I'm telling you, you shouldn't use them. Well, this is what happens for example, if you do stochastic variation of Gaussian processes. So here in blue is the exact GP on this small data set so that we can visualize it. In black X's are the observations, so these are our training data. In light blue is the uncertainty of the exact GP and in gray lines here are the inducing inputs. So this is where we initialize our summary of the training data set. Here's a summary data point here, here, here, here, here, here, and here, and here. And now during optimization, I said we optimize both the mean as these parameters mu and sigma, but also the inducing inputs. And you can see here as I optimize, here's where I start, these inducing inputs find a location that summarizes the data set well. But there's one issue. If you look at the uncertainty quantification, away from the inducing inputs, it's very, very different from the true Gaussian process from the exact Gaussian process. And even where I have inducing inputs, my posterior mean doesn't necessarily approximate the true exact posterior mean well. Now for iterative methods, I haven't shown you such plots, but I have also quantified how close it is to the true solution. So the question is, is this a problem for these linear time methods and how much of a problem is it for iterative methods? The one thing we have going for us with iterative methods is we can online control the accuracy of our approximation. We can just keep running the method until we have no more time and then stop and that's how accurate our GP approximation is. The inducing point method typically you have to set the number of inducing points a priori for your data set and then you optimize them. Now you can run the optimizer for however long you want but the number of inducing points stay the same. So you have to kind of control the fidelity of your approximation before you start optimizing. But what if this is just way faster, right? We saw that the theoretical complexity is way faster and we see now this has an issue with uncertainty quantification. So the question that I leave you with today is can we design a method where we can trust the uncertainty quantification just as much doing any part of the approximation no matter how much computation we've done. Whenever we stop, can we trust the uncertainty? Here clearly we cannot trust the uncertainty quantification of our method, right? It's way far away from the exact GP assuming our model is correct. With that, I'll summarize. We've seen that clearly scaling Gaussian processes to large data sets requires approximation. They are memory and time complexity costs associated with it. Iterative methods like the Particholesky CG enable us to approximate the posterior and do hyperparameter optimization in quadratic time in the size of the data set. There's also an interpretation of these methods as active learning algorithms and we like that precise next time. Pre-conditioning is a way to accelerate these methods and that can be interpreted as a form of prior information, something I already know about the problem that I've put in to make my method faster, to learn faster. And then finally, sparse GP approximations enable inference in linear time, so way faster, but there might be an issue of uncertainty quantification, right? And really for GPs, that's the entire point why we use Gaussian processes because we want to know how much we should trust our prediction. Okay, so next week, I'll actually make that precise. In what way are these iterative methods actually learning machines? In what ways are the tools we use to approximate learning algorithms like the Gaussian process? Learning algorithms in themselves. And I won't only use language that alludes to that, but I'll actually show you exactly. And with that, I'll close today with five minutes early so that's perfect for you to now fill out the feedback form. I very much appreciate that. And if you have any questions, raise your hand. I'm happy to answer anything to the best of my ability. Thank you.