 All right, then let's get started. I'm obviously not Professor Henig. My name's Marvin. Hi, I'm one of the PhD students in Professor Henig's group, mostly researching probabilistic numerical methods for partial differential equations. And I'm glad to be able to give you a lecture on the role of linear algebra in GPs. So after the last lecture, feedback was a bit, I guess, it felt a bit frightened. Don't worry, guys. I know last week's lecture was a lot to me. When I first heard all of this stuff, it was a lot too. You get used to it. Those of you that actually are interested in this sort of stuff, please keep pushing. Look into all of these topics if you're interested. Super interesting stuff. If you're not, it's also fine. It's completely fine. You're going to be a good probabilistic machine learning engineers without knowing all of this stuff. It's just so you know where to look, essentially. Before starting with completely new stuff, let's just have a few slides recapping what happened last week. So firstly, we saw that Gaussian processes, as kind of suspected over the course of the previous lectures, are indeed probability distributions on function spaces, on functions. We've seen that these spaces need some care when constructing them, but so far so good. We've seen that these objects, these probability measures on function spaces have what's called a covariance function. And that covariance function is actually a kernel in the sense that it's a positive definite function. And the other way around, we've seen that every kernel that you might see in, for example, statistical machine learning actually defines a Gaussian process with that kernel as a covariance function. We've seen that kernels have eigenfunctions, just like matrices have eigenvectors. And in this sense, you can actually think of them as a kind of infinite matrix that spans this function space, or a function space rather. And this function space is called the reproducing kernel Hilbert space, which you're going to hear a lot about in statistical machine learning also, I bet. And the relationship of that space to the Gaussian process, that it's essentially the space of all possible posterior mean functions of a given Gaussian process. We've also seen how the two frameworks relate by essentially saying that the posterior covariance function is related to a worst case error bound that people in statistical machine learning give in the frequentist framework, I guess, which Bayesians, however, treat as their average square. So in a sense, we're a bit more conservative. And lastly, and this was maybe a bit rushed in last week's lecture, we've seen that the samples from this Gaussian process, so if we draw a function from that probability measure, that these actually do not lie in the RKHS of the kernel. And there was actually a very good question about that in the feedback, which is, why does it even matter? Why would I care whether the samples actually lie in the RKHS or not? One question, and the other one was, it seems that because essentially with the Gaussian rights, these functions, these sample functions kind of concentrate around the posterior mean function or the mean function of the Gaussian process. And so aren't they actually in the RKHS anyways? But maybe let's start with the second question, which is that just because two functions are closed in value everywhere, does not mean that they have to be closed in a norm, which is in any norm rather, because that's essentially what we use to measure distance or similarity, if you will, between two functions. Maybe to illustrate this, let me draw a quick picture. So assume that we have some, maybe like a constant mean function of a Gaussian process somewhere, some axes here, let's assume this is zero for now. And then we can call that maybe, I don't know, like F. And then we can have a second function G that is just offset by the value of like one. So the point-wise distance between these two functions is always one, this is essentially this, can you see the curve there? Actually, oh crap, one sec. So this is essentially saying that this distance, epsilon is one, or actually smaller equal one in this case for all points in the domain of a function. Now, this is, as I said, this measure is point-wise distance of two functions, but then there is norms that are supposed to measure the length and obviously then also the distance of two functions, which don't just compare the value of a function. If you write that something like the supremum over all x in the domain of the absolute value of the function, in this norm, these two functions would also have a distance of one, right? Because the largest point-wise deviation is also one. But there's norms that actually combine, essentially, the deviation in value and the deviation in the derivative, which makes some sense, for example, if you want to measure smoothness, so you want to encourage smoothness in your solution as a regularizer, for example. So then we essentially have this added on top of it. And I can essentially construct the function, for example, something like a Gaussian, so assume that this is actually zero, then we could superimpose a Gaussian here. And that Gaussian is gonna have a parameter, right? It's gonna have the standard deviation as its parameter. Because this is one, I'm omitting the normalization constant, right? If you omit the normalization constant of a Gaussian, at its maximum value, you're always one, so that's fine. But if I now decrease the standard deviation parameter, the derivative of one of these points is essentially gonna be unbounded. It's gonna be essentially as steep as I want it to be, which also means that this norm is gonna be potentially much larger than one. So we can have two functions, which are sort of at most point-wise distance, one away from each other, which, however, in some other norm, are essentially arbitrarily far from one another. And this is essentially what's happening here, even though the samples are quite close to the mean function point-wise, they're not necessarily in the norm of the sample space if there is such a thing. So just to point that out here, that's for the second question. And then we essentially mentioned that these samples can lie outside of the RKHS because there's one very important case in which they are actually inside the RKHS, which is if the RKHS is finite dimensional. So for the feature construction of a Gaussian process where you explicitly give a finite set of features, there you can quite easily actually see that the samples lie in that space spanned by the kernel, by the parametric kernel, if you will. But as soon as the RKHS is infinite dimensional, samples will not lie in the RKHS with probability one or phrase the other way around, the probability that a sample path from a GP is actually an element of the RKHS is zero. Now again, why should you care? And the answer is, and that we actually didn't show yet, is that, well, maybe two points. First of all, there's real smoothness differences between these samples and the RKHS itself. So in the sense that functions in the RKHS are more often differentiable for some kernels than the sample paths are. And you might care about that if you, for example, wanna use Gaussian processes for Bayesian optimization where you wanna be able to differentiate the process because, well, you have derivative observations, for example. So if you wanna work with derivatives of a Gaussian process, it needs to be differentiable, and here's a good point for why you should care about the sample space then. And the other thing is, we're Bayesians, right? The object we actually care about is not necessarily the posterior mean. The posterior mean is just something that summarizes the main object of interest, namely the posterior distribution, right? And if essentially the function we wanna learn does not lie or the measure we're positing does not give mass to the space in which the function we wanna learn lies, that doesn't really make any sense in a Bayesian framework, does it? So this is why as Bayesians we should actually care, I would argue more about the sample space than just the RKHS of the kernel. And let's actually get to some, maybe I guess more practical thing here, looking at some applications where that might be worthwhile. So in general, it's pretty difficult to actually talk about, like, there's a question. The question is, I guess, how you measure close, right? This is one of these reasons, so this is coming back to this point-wise deviation versus norm-wise deviation, maybe. The other thing is that the RKHS is, in a sense, small compared to the sample space, because if you think about a linear subspace, like just in finite dimensional Gaussian distributions, a linear subspace that does not have sort of full dimensionality in a given vector space also has measure zero under a given Gaussian measure, right? So you can, I guess, kind of think of the RKHS as being lower dimensional, but that's quite difficult to quantify because we're working with infinite dimensional spaces. So it seems that one infinity is sort of larger than the other, but that's not actually the case. So this is maybe an intuition that helps there. Can you speak up a bit? Sorry. Yeah, but it's not as simple as saying the kernel needs to be differentiable or something. It's a bit more than that, right? It's obviously it comes back to a property with the kernel because the kernel actually uniquely identifies a Gaussian process, so it kind of has to reduce that. There's theorems about that. Maybe let's take this offline, then we talk about it later. Okay, so it's a bit difficult to actually talk about the sample space of a GP. If there's one that sort of all GPs share, then it's, and you saw that last week by essentially the Komogorov extension theorem, it's the space of all functions from X to R, which is a bit useless because it's very, very ill-behaved and not really useful in practice. But what one typically does is one finds an intermediate space that lies sort of around the RKHS that contains all samples from the GP as a set, and then it's helpful to associate additional structure with this, which is typically that we're talking about a Banach space, so essentially a normed space that is complete with respect to the norm, so let's not worry about that detail too much. So we're, in a sense, hallucinating additional structure that isn't there because it's mathematically appropriate. One argument, and the other one is that in applications, we often know functions to lie in a certain Banach space, which is also quite useful. And this is also quite a practical thing, actually, because if you know something about your function, for example, you know that your function needs to be at least once differentiable, then it makes sense to place your measure, essentially, your measure over the function space on a space which actually has functions that are sufficiently often differentiable because you can learn faster, right? If you can rule out hypotheses that aren't differentiable, then you need to sort of rule out less hypotheses, obviously, which means that you can more efficiently handle the data that you have because you put in more prior knowledge about the problem. And sometimes, for example, because certain transformations are only defined on certain subsets of functions, for example, the differentiable functions, you need to actually do that in order for the differential or the differentiation of a Gaussian process to make sense. Integration is kind of the same argument. Just some examples, so I already told you that Rx can be considered a sample space of the GP. It's too large for practical use because you actually don't really learn anything if you assume your function lies in that space. Good luck. The sort of next maybe smaller space that is still quite ubiquitous and useful, and actually most of the samples from these standard non-parametric kernels that you've seen, you know, square exponential, VNA process, all of these actually are at least typically continuous, so you can take the Banach space of continuous functions, which is maybe one of the sort of least assumptions you typically wanna make. The Banach space of K times continuously differentiable function is, for example, useful, as I said, if you work in Bayesian optimization, whereas sometimes you have derivative observations, and then K is kind of set to one, typically, question. You have a norm on the space, so it's a vector space that has a norm, and it's complete with respect to the norm, which means that all Cauchy sequences converge in the space, but that's not super important. You can also look that up. Yeah. So you can show that, so there is theorems that tell you what properties of a kernel function tell you that, for example, the sample spaces are gonna be continuous. Keyword there is the Kormogorov continuity theorem, and then there's additions to that which essentially tell you whether samples are continuously differentiable, and this is just from properties of the covariance function, and then as soon as you actually know that your samples are gonna be continuously differentiable, they're actually gonna lie, or the sample space is, in a sense, gonna work. You can think of the sample space as being that Banach space. If you wanna read up on that, my paper actually has quite an extensive, the samples of that are gonna be smooth. There's papers that show that, and then essentially the RBF kernel sample space is any CK space, and you're gonna pick and practice the one that's interesting to you, right? If you need 20 derivatives, then you're gonna pick C20, essentially. That's gonna be the idea. Yeah, then sobolev spaces, I'm not sure whether anybody has heard even of a sobolev space, but these things appear quite often in partial differential equations, and if you wanna use a GP to infer the solution of a PDE, that's the space we kinda wanna look at. And then there's actually a notion of an RKHS that is gonna be the sample space of a GP, but it's not the RKHS of the kernel. It's a systematic way of enlarging the RKHS of the kernel, which is called power of an RKHS. Just to flash this at you, there's this notion, which sort of continuously grows the RKHS by in these series representations, in these master series representations, introducing this theta power on top of the eigen, or the square root of the eigenvalue, rather, which then makes the space larger. So just to show you that. But now, unless there's any more questions about this, yeah. Well, the reasoning is a bit the other way around, kind of. If you know that you're gonna observe derivatives of degree 20, then you have to find a kernel whose sample paths are actually gonna lie in that space. That's the story. Well, you just know for some kernels that they produce these samples. It's kind of a bit of a circular reasoning, but it's important to think about that, because otherwise the operation of differentiating the processes are obviously gonna be ill-defined, right? So this is, and it actually has to be, so you have to think about it for the sample space for it to make probabilistic sense, right? There's a bit more to that story, and there's quite some references to it in some of these references we gave in the lecture. If you wanna read up on it, but it's mostly, you know, if you wanna differentiate your probabilist, or the functions over which you have probabilistic belief, then that belief better, well, the functions your belief is over better be differentiable, if that makes sense. Maybe one quick question. What does you mean? That's kind of the idea. So this is this informal type of argument here, because then if you remember from last lecture, there was that kind of informal argument that if you sum over the norm of a sample from essentially coming from the Kaun-Lüwe transformation, then that is gonna be infinite. If you raise that to a certain power, then it might not be anymore. That's kind of the idea, but it's, there's more rigor that needs to be sort of taken care of you, right? All right, so maybe let's come to something a bit more practical. So we've always been talking about that GP regression is, in a sense, relatively expensive, polynomially so, but still. So what do we do? We have quite an expensive method. So it's, first of all, maybe quite interesting to look at why it's expensive. What is the actual operation that costs us this much? And then maybe think of some solutions of that. So this is what we're gonna do today. And we're gonna do this by sort of zooming in from a very, very high level into our recurring implementation of a Gaussian process, this abstraction that Professor Henning has been showing you. Yeah, and maybe localizing where things go, not necessarily wrong, but where things are expensive. So first of all, obviously the problem that we're gonna solve here is we wanna learn a function from given set of finite set of input-output pairs, which is what we typically call a supervised machine learning, supervised regression problem. And we in this lecture are particularly working on this using a Gaussian process model. So we posit a Gaussian process prior over that function F with some given mean and covariance function. And then we model this data set essentially as IID observations from that generative model. And we use this technique of Bayesian inference to get a posterior belief, posterior probability distribution over that unknown function F. In code for us the sort of level at which we're interfacing with this technique is this. We define a mean function, we define a kernel function, covariance function if you ask me. And then we define a prior quite generically by just passing it these two functions, which is like notice literally the thing we wrote on in the mathematics above kind of. And then we're conditioning on the data, which is well, Bayesian inference in a nutshell in a couple, what nine lines of code here. But notice that this code actually doesn't cost anything. If you look at this, we're defining functions, we're defining some object which just uses these functions but never evaluates them, it's just storing them. And then we're conditioning on something and if you actually look closely into the implementation of the conditioned function, it just takes these objects and stores them in a new object. So it doesn't really do anything. So this is not expensive. Where does the expensive stuff happen? So maybe let's first go back a bit to the mathematics of the Gaussian process to understand a bit better what we're actually implementing here. So the posterior Gaussian, well, first of all, the posterior of a Gaussian process under some point of point observations, essentially some finite observations from this GP is gonna be another GP with said mean function and covariance function, posterior covariance function. And well, here's the thing in sort of math, sorry, code terms. It's still actually quite similar to the mathematics. There's a bit of magic happening which we're gonna get into in a second. But notice that this stuff also doesn't cost anything because we don't necessarily evaluate these functions yet. So again, where's the cost? Which is when we get into the numerics layer. So if we're actually evaluating the Gaussian process, so we wanna predict essentially what our posterior believe at a certain set of observation points or rather sort of prediction points is we're gonna call the mean and covariance functions which are in turn gonna call these, what I call cash properties. I think Professor Henning already sort of explained what a cash property does. So I don't really have to go over there again. But notice that first of all, here is what, and I'm not sure if you guys already talked about this, but here's where the memory cost comes in. So the first function evaluates the kernel function on all pairs of data points and thus constructing the kernel matrix which costs n square memory over n square memory. And this is arguably the worst thing about a GP. Square memory cost is kind of bad, right? You can let an algorithm run longer but it's difficult to sort of keep plugging in RAM sticks into your machine you're working in because it's just gonna eat up all your RAM. So this is something we need to fix and this is actually not necessarily the topic of today's lecture. Professor Henning is gonna talk about this, I think, next lecture. But the main cost in terms of computation is actually this line where we're computing what's called Cholesky decomposition of this predictive covariance. Yeah, and why would we do that? I'm not sure if all of you guys actually talked about Cholesky decomposition in their undergrad, so we'll actually briefly go over this now. So maybe hands up, who's heard of Gaussian elimination before? I kinda expected that, nice. Who's heard of LU or Cholesky decomposition or dealt with them? Okay, that's a bit less. Don't worry, I realize there's a bit of a wall of text but we're gonna chew through it quite slowly. So I suspect most of you have seen Gaussian elimination before. So this is a technique for computationally solving a linear system. I guess most of you guys have done this in high school. And the main idea is that, well, we start with a matrix A. Maybe ignore for now that index and we're gonna split off the first row of that matrix. For now we're gonna leave that as it is. And then the idea of Gaussian elimination is to subtract that row from all successive rows in such a way that we get zeros in this first column. How do we do that? Well, we divide the entire row by alpha and then before adding it to the respective row, like the successive row, we're gonna scale it by the magnitude of the first entry in that row. And then if we add rather subtract the two rows, we're gonna get a zero here. Specifically, we can define, and this is sort of what I just said in vector notation, we can define this vector L, which essentially takes this entry B in the first column of this lower left block here, so this column vector B, and divides it by alpha. We're gonna scale the row vector U by that, and we're gonna subtract it from this block. Why just this block? Well, because we know that's gonna be zero, right? So the only non-trivial computation we actually have to do is in this block. And if we actually write this out, maybe before we even focus about this L and this U there, what we're doing is we're taking this matrix alpha U transpose B B, and then we're transforming it to the matrix alpha U transpose zero vector, and then B minus what we defined there, L U transpose, right? So this is what we'll end up with. And now the idea is to repeat this process recursively on these blocks until we reach what's called a row echelon form, because we're assuming that the matrix is actually invertible here. This is just gonna be an upper triangular matrix in the end if we repeat this process sort of recursively all the way down. Clear? Questions? All right. So the cool thing now is that we can actually invert these operations that we're doing on the matrix, essentially subtracting rows from one another by writing another matrix L in front of that matrix, which is also sort of constructed in a recursive way. And it turns out that this matrix L is gonna be lower triangular. How do we see this? Well, first of all, and this is a recursive argument, at least the ones studying computer science know that to understand recursion you have to understand recursion. So this is kind of always bit of a difficult part here. But let's assume, and this is what a recursive argument does, right? We can actually decompose a matrix of dimensions N minus one by N minus one, which this is if the original matrix is N by N, right? Into lower triangular matrix times upper triangular matrix. And the base case of this is, well, a one by one matrix, which this fulfills trivially, right? So assume that we can actually write this matrix, which we call here A one plus I. Assume that we can decompose A one, sorry, I plus one, as some lower triangular L, I plus one, upper triangular U, I plus one. And this upper triangular U, I plus one just comes from this, applying this Gauss elimination procedure, sort of all the way down. And then we can plug this in here, this decomposition, and factorize again, and we'll see that this, we essentially get this equation up there, right? So we get that A equals Li Ui, where L and U use this L matrix, and this U matrix, we got from the version of this for matrices of one dimension lower, and then recombining terms. Don't worry if you didn't get all of that, just sit down at home, go through the maths, and like actually multiply the matrices out, you're gonna figure it out. It's not that difficult, it's just, you know, there's terms flying around all the time. It's a bit hand wavy maybe. So there might be a problem with that. Who noticed that we did something which might cause problems, at least numerically? Yes, so we're dividing by that matrix alpha, and we don't know whether it's positive, right? Or well, whether it's rather non-zero, but yeah. And we can actually show that if we repeat this process, and maybe swap some rows actually, to always pick the largest remaining, sort of leading value alpha here, which is called pivoting, then this process actually always works. So we're always dividing by a non-zero number, and we always get this decomposition, and this is kind of equivalent to A being invertible in this case. Computationally, how much does it cost? Well, at least in leading order terms, computing these outer products here, costs in iteration i, n minus i, because that matrix is gonna be n minus i by n minus i products, and then we're gonna subtract that from the original part of the matrix, so that's n minus i additions. So essentially we end up with two thirds n cubed if you actually do the mass here. So this is where that n cubed runtime of matrix inversion actually comes from in the end. But what does it help us? Now we didn't actually solve a linear system, but we just wrote our matrix as a product of two matrices. That seems like it makes it a bit more difficult actually. But luckily, because these matrices actually have structure, again, wall of text, we're gonna get through it, because these matrices L and U actually have structure, we can invert them super cheaply, namely instead of n cubed, it's gonna cost us n square, which is the same as the cost of a matrix vector product, just to set that in a bit of context. So we now, if we computed this decomposition, can solve at the same cost as multiplying a matrix in a vector, or rather two matrix vector multiplications, which I'm gonna show you in a bit, maybe a question. To leading order, it's not necessarily big O, because for the leading order term, you want the constant kind of, that's the idea. Big O would discard that, right? So in big O it's all n cubed, and in this leading order notation, it's two thirds n cubed actually, flops, which we're gonna see we can reduce. So we're recursively partitioning that matrix, right? And at iteration i, that block is gonna have size n minus i, essentially, right? In the first iteration, it's gonna be n minus one, because we split our first column first row. And because we're descending into that matrix now, recursively, and we sort of need to subtract the outer product in every iteration, we're gonna pay that at every iteration, if that makes sense, yeah, that's where it comes from. Right, so we now have this, the system ax equal b is now lux equal b, because we now assume that we have this decomposition, which means that we can now solve two linear systems, namely ly equal b, and then if we know y, we solve ux equal b, and then we get our solution x to the system. The nice thing about this is, because l and u are lower and upper triangular, respectively, we can use a technique called forward substitution and backward substitution now, which is much cheaper, and it's gonna work like this, it's also one of these recursive things. So if we assume that we already know the solution, essentially in the first n minus one entries of the system, again, recursive argument, so you have to believe me that it's possible to do that for smaller matrices here, then we can take that solution, and essentially the argument that's now following is only possible because we have a zero in that block here, and multiply it in and actually solve for the unknown solution gamma i in the last row now, because now we assume we know y, right? So if we know the y that fulfills l y equal b, then we can essentially multiply out l y and rearrange terms to arrive at this term here. Again, maybe sit down at home and actually go through the maths, right? So what's happening is that this is gonna be y times l and gamma times lambda here, and then bringing everything to the other side and dividing by lambda, we get this, and then we can sort of update our solution to the smaller problem by appending that gamma to it, and if we do this recursively all the way down here, again, the base case is a scalar equation, right? Scalar times y n minus one, I guess, equals some scalar b, and that we just solve by division, and then we're done. And this, as I said, costs quadratically in the number of rows and columns of the matrix, because what we're doing essentially here is an inner product between two vectors of length i in this case at every iteration, and yeah, then summing that up, we get n squared. All right, so that means that once we actually have computed an LU decomposition, we can solve k linear systems in time, what it costs us to compute the LU decomposition plus k times two n square. And that's actually much cheaper than computing an inverse as a matrix, so calling numpy.inf, and then multiplying with that matrix, right? Because to actually get numpy.inf, what you're essentially doing is you're computing an LU decomposition, multiplying with all the canonical basis vectors, so essentially a diagonal matrix, the identity matrix, and that's gonna cost you two thirds n cubed plus two n cubed. So you actually get another essentially n cubed term in there plus k n square, because that's what it costs to multiply with the inverse matrix. So that's, this is why people are telling you use a matrix decomposition and solve with it instead of actually computing the inverse matrix and then multiplying with it, because it's much cheaper. And at the same time, it's actually should be also more stable, numerically stable. All right, so but we actually are not using all of the structure in the matrix that we have because we know that our kernel matrix is symmetric and positive definite. And there's a specialized version of the LU decomposition called the Cholesky decomposition. You can convince yourselves essentially that the same argument that we just went through for LU decomposition holds for Cholesky decomposition where now we sort of want to keep things symmetric. So that means instead of just leaving this factor alpha up here, we actually spread it to both factors, if that makes sense. So we take it square root, which is always gonna be possible because of positive definiteness of the matrix. And then the argument is essentially the same, just we're now careful to actually keep up the symmetry such that the factors are actually just gonna be transposes of each other. So we essentially get just one factor that we need to compute. And then the upper triangular factor is gonna be its transpose. And you can convince yourselves, it's actually not too hard to do that, that this cost is exactly half of what an LU decomposition costs us. So that's nice. We get a free factor of two speed up essentially. The interesting thing here is that at least an exact arithmetic, this does not need pivoting. So we don't need to swap rows and columns. This is guaranteed to always be positive definiteness. Actually there's a theorem that says that this decomposition is in some sense unique for every matrix. Yes, and it's still sometimes a good idea to actually do pivoting in practice. And we're actually gonna talk about what pivoting means in the context of a Gaussian process because there's a very specific interpretation of this. Maybe actually somebody already has an idea. What might that be? Let's see, maybe one slide later. And that's gonna improve the stability of this algorithm because while we have round off errors, we are operating in finite precision on a computer. And so we actually need to take some care here. Not a lot of people actually know that you can write the Cholesky decomposition as an iterative algorithm. So like just brushing over this very, very quickly. What we're doing right here is essentially what we've been doing in the LU decomposition. We're at iteration i, we're taking the matrix that we've been working with and we're splitting off the i-th column dividing that i-th column by its first entry. This is taking the alpha and dividing the b by the alpha from the previous slides. And that's gonna be this L vector. It's actually now appended by alpha in the first entry, but don't worry about it. And then we're computing that rank one down date. So we're subtracting from the matrix that we've been working on, this outer product of the else, which if you look at this, actually cancels out the first row, or well, not the first row, but rather the i-th row and the i-th column of that matrix. So it sets it to zero, which is exactly what we wanna do here. And then we just add that L vector to the columns of essentially a growing of a matrix that becomes a wider buffer in which we store this lower triangular L vector. And in this sense, if you were to actually terminate this algorithm early, which is something you can do once you think about an iterative procedure rather than a recursive one, you're actually building a low rank approximation to the matrix, which that's actually quite helpful because, well, you can invert these much quicker. We can just early terminate the algorithm and work with that factor. And this is something, Professor Henning is gonna spend a lot of time talking about next week. So this is actually one of the approaches of speeding up, or methods like this are actually one of the approaches of speeding up Gaussian process inference to subcubic time complexity. And sort of a graphic that I was, what I was talking about, you can see here, right, in iteration one, what you're doing is, well, A prime is A. So you're subtracting these vectors, which zero out first column, first row of the matrix. And then this gray part is the only thing that's sort of not yet zero, if that makes sense. And the L factor is just gonna be the outer product of two vectors. So it's a rank one approximation. And in the next iteration, when we actually, what we're doing here is, we're now getting the second column of that matrix A, which is now this one, right? Then we get a zero in the first entry. So we keep up the zeros in this L-shaped region here and sort of keep reducing that unexplored region in gray. And this is the picture you have to have in mind here. And this is actually also how you can see that actually the iteration I of this process only takes N minus I square operations, right? Because we, in a sense, and we didn't write this in the algorithm, but you can implement it more efficiently if you just operate in this region because everything else is just gonna be zero anyways. All right. And okay, the other thing is, this first step that we're doing here actually has to touch all the data points. And this is gonna be important in a bit. All right. So what we've seen here is that for the context of this lecture, training, learning machine, a Gaussian process reduces in terms of computation to computing a matrix decomposition. And why do we do this? Well, it's essentially a way of amortizing the computations that we're doing. We're spending time computing that decomposition once because afterwards it's actually very, very cheap. Like once we have this structure, it's like an acceleration structure. It's very cheap to solve with the matrix and to do, one second, to do prediction with a Gaussian process, actually. You need to, for the posterior covariance, you need to solve quite a lot of systems with that matrix, right? So it's nice that we can actually speed this computation up by quite a bit, but it still costs us cubically once. After that, all operations are quadratic. And we've seen that this Cholesky decomposition, we didn't actually talk about stability, but the pivoted version of that Cholesky decomposition is a reasonably numerically stable approach of computing this decomposition numerically. Yeah, and once we have that, this amortization structure, things actually become a lot more cheap. But cubic complexity is quite bad, right? So it's definitely not enough if we're in a big data setting. But it doesn't mean that this comes from being Bayesian and we're gonna talk about this a bit in the coming slides, I guess. But there's other reasons why we actually pay cubically here. There's ways to speed this up. And maybe we can even work towards something that is, if you will, as expensive as deep learning because deep learning and I guess more generally, regularized empirical risk minimization is often thought of as being, well, depends on who you ask, O of N, or maybe even O of one, if you use batching and stochastic optimization, because that's actually gonna then be independent. The cost of that is then gonna be independent of data set size. Yeah, and actually, so classical kernel machines, rigid regression, is actually also in O of N cubed. Why is that? Gonna answer this in the next couple of slides. That was a question. In code, you mean? Or? Yeah, let's maybe start in code and then go on to math. It's maybe a good idea. So what we're doing here is if we actually call, for example, if we plot a Gaussian process posterior, or I think it's called a conditional Gaussian process object in this code, then what's gonna be happening is, at some point, the mean and covariance function are gonna be called, right? And here, there's this object called represent awaits. And here, there's this show solve routine. Now, assuming that this predictive covariance show thing actually contains some representation of a Cholesky factor, this show solve thing does essentially two of these triangular solves in sequence, which is, you know, one's forward substitution and one's backwards substitution, as we've seen on the slides, to actually essentially compute, actually, it's up there, this right here. So these two terms. There's one inverse in there. And if we don't wanna do prediction at, let's say M points, which is gonna be then this suck thing here, then we're solving implicitly M linear systems at cost, I have to think, M, okay, two M N squared here. And then there's gonna be another matrix vector product with that, which is then also gonna be M N squared because we're contracting over the N dimension, right? Yeah, so this is, this is where we use the Cholesky decomposition in the covariance. And then in the mean, this what's called represent awaits here is essentially just a way of writing this vector, right? So this is a vector, y minus mu x. So the product of y minus mu x with the inverse Gramian is gonna be a vector again, and because this is gonna be the same vector for every point at which we're gonna be predicting, we can just cache that, right? So that we don't need to recompute it sort of for every prediction. And we call that represent awaits because did you guys already talk about kernel machines in SML now? Okay, so this is what it's called in SML, don't worry about it. It's just a name that comes from this kernel ridge regression. Yes, so essentially it's, because it's the same matrix we're inverting, we get away with one decomposition. There's one caveat though, which is that if we wanna sample from the posterior, we need the Cholesky decomposition of this entire, well, if we evaluate this at the test points, then we need to compute the Cholesky decomposition of that again to sample, right? So that's a bit of a bummer. There's ways around this. There's certain, yeah, I think it's mostly approximate schemes that sort of help you get away with just this one matrix inverse, but that's then also gonna be dependent on the kernel. So that's a bit more complicated story, I guess. And also sometimes for sampling, specifically from the posterior, because that might collapse, right? There might be zero eigenvalues in the matrix is sometimes actually better to use a, like a symmetric matrix square root or which then you typically do by an SVD. So that's a bit of a different story which we didn't really go through here. But yeah, and then these are essentially the routines that compute and cache these quantities, right? So here's the Chole factor compute, so to ask you. Another question? Sorry, okay. Yes. So you mean if you sort of have data arriving in an online fashion, yes, this is gonna be this week's exercise sheet. So I'm not gonna say too much about that. The idea is that there's versions of this Cholesky decomposition that help you reuse the Cholesky decomposition of existing blocks and then update that of a bigger matrix, right? So in a sense, if you have a matrix and you append sort of three blocks to it, right? Then you can update the Cholesky decomposition of the original block at cheaper cost than of n squared where n is now the dimension of the larger matrix. And that is how you, I guess, most elegantly implement like online versions of Gaussian process conditioning where, well actually, to be honest, like, again because instantiating a Gaussian process and particularly also a posterior Gaussian process doesn't cost anything, you're only really gonna run into this problem if you wanna condition on a batch of data, then evaluate and then there's more data coming because otherwise you can just concatenate the data sets and then evaluate afterwards because just instantiating doesn't cost you anything. So it depends a bit on what the actual scenario is there. Yeah, this is how you get to it sort of, mathematically I would say, like why this works, but then the linear algebra is actually different because it's actually more efficient, by the way. All right. More questions? Yeah, but the story is a bit more difficult there because essentially you can rewrite the trace as different sort of linear combinations of the eigenvalues, right? And because you're actually not operating an eigenbasis here, that won't be the case for, in general at least for the Cholesky decomposition, but I think it's a good sort of mental picture to have in mind that you're trying to find directions in which the matrix sort of spans this ellipsoid you're typically thinking about when you're thinking about a matrix, yeah. And then specifically for LU there's also two pivotings you can apply, right? There's you can permute rows and you can permute columns and if you just, I think if you just permute rows then it's called partial pivoting and the most, I think the most stable variants actually do both such that you never actually run into, well you do run into cancellation issues but you're mitigating them as far as you can. That is actually the point I wanted to allude to earlier. So having seen that Cholesky and actually you can do the same thing with LU2, I think it's called the outer product version of the algorithm, you can only stop this of course but now the question is in the context of GP regression you kind of want pivoting, right? Because if you, like for example if your condition, if your data lies in a lint space, right? So essentially the order of the data points in the matrix corresponds to them being close to one another. You don't necessarily wanna walk over that data set in that order because when you've already conditioned on a data point at some place you're obviously gonna learn more and that depends on the kernel also but let's talk about stationary kernels for now which have sort of a local impact on the function. Then the points that are farther away from the data points you've already seen grant you more information about the unknown function. So it's actually a good idea to either have pivoting which tries to find an order of the data points which is that's kind of what I was alluding to what pivoting means in the context of Gaussian process regression. That well orders them in such a way that at each step you're getting the most like let's say informally speaking information gain from your data points, yeah. And that is something again, Professor Henning is gonna talk about a lot next week. Oh, sorry, yeah, I forgot that. So the question was here about pivoting, whether pivoting has something to do with trying to find or sort of eliminate the largest eigenvalues first and the answer to that is well, the pivots so that essentially these diagonal entries that you're working with don't necessarily have anything to do with the eigenvalues in this case because these L vectors aren't necessarily the eigenvectors, right? So that's why they're not the same. All right, let's take a quick break. I'm actually gonna be around if there's any questions about last week's lecture or anything we've talked about today. So drop by, I'm also here after the lecture. All right, so I just got a really good question actually that I should have mentioned I think on this slide which is why is this O of N respectively O of one? Because well, there's gonna be in each iteration of say SGD there's gonna be a huge network that needs to be evaluated and differentiated through but this is just to look at how the method scales with data. This is all notation, right? So if we assume we take a network independent of the number of data points that we actually have and fix that size of the network then it's gonna be O of one but otherwise this would be something like O of however much it costs to evaluate the forward and backward parts of a neural network. Simple as that. Question? I don't quite understand the point. Well, I mean a lot of models are actually trained without even doing one epoch, right? So there is definitely a point to be had there, right? But this is the sort of like we're reflecting the current practice kind of and this is arguably also where this distinction comes from, right? Well, one iteration costs you essentially O of one so K iteration costs you something like O of K which is still O of one but arguably shouldn't, and well, we don't quite know all that much about how deep learning models train, right? But there is in a certain sense you would expect that O of one to actually scale with the number of data points because well, arguably you could have just also looked at one data point and then it's gonna be in the same computational cost class here. So this is why depending on who you ask I think you get different answers for that. But if we assume that we maybe go through the data set once then it's often, right? Okay, so maybe to actually show you guys what I mean when I say deep learning. When I say, what I mean is essentially regularized empirical risk minimization to find the weights of some neural network where we have some per element loss function lowercase l here. Some regularizer that just depends on the weights. This decomposes into a sum over the training data points and typically right in deep learning I guess most of you already did take the deep learning lecture last semester. What you typically do is you use a stochastic first order optimizer. So you're not actually optimizing this loss function but you're drawing B examples from your data set and then computing that expectation. And well particularly also it's gradient on that batch and that means that this what's called the mini batch gradient GW or G of W is actually a random variable. And well that random variable is depending on how you draw the data points from the training set a better or worse maybe sometimes unbiased sometimes not unbiased estimator of the true gradient but it's still quite noisy. And well this is also where that O of one comes from because now this sum doesn't scale with the number of data points and it's just O of B where B is the batch size. So each iteration is largely independent of N but then we might need a larger number of them. So I guess compute costs debatable. All right now let's answer the question how GPs actually fit into all of this and in the last lecture we already set the groundwork I guess when we noticed that the posterior mean of a Gaussian process is actually related to that regularized empirical risk minimization problem. Here I'm using a bit of a different approach to it I'm not using exactly the kernel ridge estimate because that's gonna like involve some theory that you will go through in statistical machine learning but we don't have access to yet so let's actually go a bit of a simplified route and note that we know that the posterior mean evaluated at some finite data set X is gonna be a Gaussian because well the posterior process is gonna be a Gaussian process so evaluating it is a Gaussian random variable and its mean is also the mode of that posterior distribution over the process evaluated at the training points given the data because for a Gaussian mean and mode coincide right and if we write that down essentially so we say that the mean, the posterior mean at the training points is essentially the max of the log posterior density over whatever values the process takes at those points which by Bayes rule is the sum of log prior and log, sorry, likelihood and log prior plus some constant term which doesn't matter because we're just considering the arg max right. Flipping signs we get a min instead of a max here and then if we actually expand these terms because we know both of these are gonna be Gaussian we get this and let's actually compare to what we've seen on the previous slide. First of all we have a term that expands into a sum over data points essentially so these are the individual per example losses that we incur or these are not D but these are individual per example losses that we incur for every data point and then we have something that just depends on the weights of the model and some parameters I guess and this you can see as a regularizer particularly this is like a squared kxx inverse norm so a norm weighted by the inverse kernel matrix an L2 norm weighted by the inverse kernel matrix which would be if that thing were diagonal that would just be an L2 regularizer so it's related to that. There was a question here, everything good? Okay. And then this one you mean? Yeah so we're assuming additive Gaussian observation noise with independent measurement noise sigma squared and that's yeah so it's a Gaussian with mean fx and covariance diagonal scale or identity scaled with sigma squared exactly and this is also where this term comes from right this is why factorizes over or sorry why the p factorizes and why we get a sum in the log density. Make sense? Everybody on board? All right, nice. Now notice that we're actually just well okay so first of all we now have successfully cast at least an aspect of Bayesian reasoning into an optimization problem which is sort of the first step that we need to take by comparing the two I guess right because well deep learning solves an optimization problem so obviously a Gaussian process in a sense also solves an optimization problem on the way by different means but it does so. Second thing note that this is actually just the mean of the posterior mean of the well the unknown function evaluated at the data points but because we have some conditional independence in our model and if you write that down it's actually not too hard to see that if you actually know f of x then the data are independent of all the prediction points and we can use that to given the well actually we need the posterior mean and I think the posterior covariance at the data points but you can get that in the regular way you can predict onto new unseen points by essentially this formula here and I'm not sure if I actually want to go over this so this is also in Rasmussen-Williams it's not too important it's just to show you that this is possible if you just solve that one optimization problem you can still access the full posterior if you then also compute the kxx. Okay so that means that we can in a sense train a Gaussian process by solving a quadratic optimization problem so a least squares problem with quadratic regularizer yeah. So the idea here is that if you write down did you guys talk about conditional independence in the lecture before I guess so right so the idea here is that sorry the question was what that argument about conditional independence was if you have some Gaussian process prior f then what we're doing here is we're evaluating the process at some data points right and by adding Gaussian noise to that independent Gaussian noise to that we derive another random variable y so the observations whose value we observe in the end and what we want to do now is we want to predict at previously unseen points f star or did we call it yeah fx star rather right so if we know this node so if we assume that this is given then there's a chain as a part of this graph and so this because essentially this chain is blocked by knowing this node we don't have any or we get conditional independence of this from this not sure if you guys talked about this like how to reason in these graphical models before but that's the argument here as I said there is a section about this in Rasmus and Williams if you want to read up on it a bit more it's just like don't worry about it too much if you didn't get everything that I was saying here it's just to argue why it actually suffices to solve the optimization at the training data yeah that's pretty much the argument here wait a couple of slides so the question was how you actually get the entire uncertainty because what we're doing here is just we find the mode right the map estimate exactly and we're gonna talk about this in a couple of slides it's gonna be a complicated story let's put it like this not complicated in the sense of the math is complicated it's just it's not as easy to get to a satisfying answer maybe okay okay so and then there's a bit of a problem here now because if you look closely Fx is gonna be you can think of it as a vector of weights or parameters in Rm so this is actually gonna be at least O of n to solve this optimization problem because well that's our the cost of evaluating our model essentially so the number of weights in our model always is gonna scale with the number of data points that we're looking at but that's not a problem of being probabilistic the exactly the same problem you also incur in kernel ridge regression this is rather due to the fact that we're working with a non-parametric model so you know like an infinitely wide neural network if you wanna think about it this way because we actually have no finite dimensional representation of the model that was the entire point of deriving a Gaussian process that we can actually somewhat efficiently work with infinite numbers of features but still get finer computation times in the end and so if we take a non-parametric frequentist model like kernel ridge regression with suitably chosen kernels we actually just incur the same problem so it's not the cost of being Bayesian that makes GP regression necessarily costly but it's the fact that well it's non-parametric right and possible ways around this is well we can just use a parametric model again of course we can just choose a kernel that is constructed from finitely many features and then we can cast this optimization problem entirely in weight space and essentially arrive at the same cost as a deep learning model and this is also what people at least in one branch of Bayesian deep learning do and I think Philip is gonna talk about this for quite a substantial part of the coming lectures but for now let's actually stick to the non-parametric model because well that's the more general case and you can always cast it as a feature-based model there are actually quite a lot of approximation methods to mitigate particularly that problem that have been constructed in the literature sparse GP regression and inducing point methods are sort of very well known approaches there you can also use structure in the matrices there's all sorts of things but for now let's just keep it very simple and accept that we have to deal with n parameters because maybe you can convince yourselves that you can get rid of this as soon as you sort of switch to a finite dimensional representation of the model yeah so now that we have this loss function what we typically do is well we use an optimizer and we use first-order gradient-based optimization method so we actually need a gradient of this implementing this loss directly like evaluating that function you can see already would also take off n cubed because you have an inverse of the kernel matrix in here we can sort of employ a bit of a trick here and note that we on paper can actually write down the solution to this optimization problem and it's just gonna be the posterior mean of the Gaussian process and then the bullet is just gonna be in this case also capital X right so we can actually see that there is a vector alpha that in code we call the representer weights you're gonna see in SML in a couple of weeks I guess why which just encompasses this inverse times the sort of mean adjusted data or centered data and we can now do a change of coordinates in the optimization problem essentially rewrite everything in here so essentially substitute fx by this one up there k or rather mu x plus and this is all the capital X sorry kxx alpha and if you do that you arrive at this optimization problem just changing variables now our loss function is not a function of fx anymore but a function of alpha so we optimize over alpha note that this is a one to one correspondence if we assume k to be invertible so for every alpha there's gonna be an fx and the other way around so we can actually do this in a sensible manner and then this is the loss function we're gonna get in this form it's still actually not valuable because you have that inverse in there but if you multiply out this quadratic then you just arrive at a very simple loss function quadratic and alpha which doesn't involve any matrix inverses and computing so this one we can actually evaluate because that constant part is gonna so essentially the inverses are gonna disappear in a constant part because it don't actually depend on alpha so for optimizing this loss function we don't need to care about them and then in the gradient obviously they also disappear so you can either see this as we use this loss function and don't have access to the loss function itself but we can access the gradient because while these nasty terms disappear in the gradient or you can see it as we rewrite the loss function and then there's none of these terms in there anymore questions? Oh sorry, yeah. So this is the gradient and that should be like clearly or pretty straightforward to derive here yeah and then let's actually implement this okay so I'm gonna need, yeah, sorry? Yeah, sorry, yeah, there you go. You mean from the upper one to the, to this one? Just, okay, so do I have time to do this? Maybe let's try to do it at the end. The idea is you substitute fx equals this term for every occurrence of fx in the loss function of the original optimization problem and then collect a bunch of terms. Just multiply everything out and realize that a sum over independent data points there can actually be written as a similar quadratic form with a diagonal matrix in the middle. That's the idea and this is actually why the, why we get the sigma squared identity matrix here. So this is where that comes from. Yeah, so essentially the idea is that if you knew this solution of this linear system then you would know the mean, right? And the only thing that's unknown here is gonna be this, well, precisely the solution which is a vector alpha, right? So we don't know alpha, but we know that the solution has this sort of, if you will, like a functional form if you knew alpha. So we can substitute that in here. The idea is kind of to solve the optimization problem partially and then reformulate it as sort of the task of the optimization problem is to find the optimal alpha. Yeah, but it's costly. So we don't actually wanna call an inverse. We wanna solve this by gradient descent. We wanna essentially treat the GP as a neural network, if you will. So if that makes sense. Is that clear to anyone? So, sorry, the question was where does the alpha come from? More questions about this? You mean this one? Yeah. Yeah, sorry. So fx is just the value that essentially mux takes, right? So it's just a shorthand name for it. It's just some symbol. And we wanna optimize over that. And yeah, so mux is the sort of, if the bullet were the capital X, the training data set, then mux or the optimal choice for fx would be mux. The mode of the sort of point-evaluated posterior is the same as the mean of the point-evaluated posterior. It's a Gaussian. That's the idea. All right, let me quickly switch to code now. Okay, so we're using essentially the same method that we've been using before. Sorry, model we've been using before. We use a zero mean Gaussian process with a Matern five halves kernel. And the data set we're gonna use because we want to have some control over where the data points actually lie is just gonna be generated on model. So we draw a function from that prior and we add noise to finitely many evaluations from that prior and then that's gonna be our training data set. So the model should ideally be able to recover the data perfectly because we're using the same generative process to recover the data that we've been using to generate it. And we can now just have something to compare to actually compute the posterior in closed form, which is kind of what we wanna be avoiding but we also need to know what we're targeting, right? And then that's gonna look like this. Standard Gaussian process posterior, I guess. You can also see the data points and actually I would say matches the data quite nicely, which is good. But now recall that this is gonna be expensive, right? Cubically expensive in a number of data points and what we wanted to do is solve it by gradient descent. So let's actually do that. What do we need for gradient descent? Well, we need a gradient of the objective function and that is gonna be this term. So it's the Gramian kernel matrix evaluated at both sides of the training points plus, and this is actually happening up here, the diagonal matrix scaled by the observation, the variance of the observation noise. And then we also need the prior mean, evaluated at training points, that's just gonna be a zero area in this case, but obviously, you know, this code generalizes that. And then we have this alpha vector and the data points. So the gradient depends on the alpha vector which we now try to optimize for and the data points. And then having found an alpha, the model, which is to say in this case, the prediction at some point, lowercase x, given the training data at x, the posterior mean here, it's just gonna be a linear projection of that alpha we learned by optimization through this kernel matrix plus the primary mean. Again, this is zero, so this is fine. Okay, so let's start by initializing alpha. We need to initialize our parameters. For now, we're gonna choose zero here. Yeah, and then evaluating the gradient is just as simple as evaluating that function. And obviously, we could have also just implemented the loss function used autodiff, but depending on which loss function we use, that would have been quite costly. We could have used the last function, but here the gradient is actually super simple, so we just implement it by hand, right? And then we train the Gaussian process by gradient descent. We define a learning rate, we define a number of epochs. Note that actually every gradient step is in a sense, bends a bit on how you actually define it, is an epoch here because we actually have to use, because we have n parameters, have to use the entire data set. So technically it's not stochastic gradient descent that we're using here, but rather like actual gradient descent. And the fact that we call SGD here is just because it's readily available, I guess. But yeah, so we initialize and we iterate the typical gradient descent equations. Well, it's kind of abstracted away here. We could also use another optimizer, but yeah, what this does is gradient descent actually without a line search with a fixed learning rate. And then what we get is this right here. So this dashed red line is the true posterior mean that we try to target. And the green line is the estimate we get by gradient descent. This is the loss curve we get, it's super smooth because it's not a stochastic optimizer, obviously. It's maybe not even done training yet, but don't worry about it. Yeah, so we've, in a very vague sense maybe, trained a Gaussian process by gradient descent now. Questions? So yeah, yeah. So you mean the training data? So the question was whether we're sampling the training data from the same distribution, rather the prior actually, yeah. So that means the model is, the data is sort of on model. You could also use other data. It's just so we can control where the training data points actually lie. But this is a different question, right? This is the question. So the question was whether it would look like this is we use the different model for generating the data if I understood correctly. And of course it wouldn't, but that's not a question of whether we train a GP with gradient descent or not. It's rather a question of model mismatch. And that's something, yeah, regular Gaussian process, like essentially the thing we saw before would also have issues with. Because well, there should be the same thing, right? The posterior mean should be the same thing as the posterior mode kind of. All right, so what's the problem with this? Why don't we just always train GPs by gradient descent? Yeah, that's one argument. So if we use a matrix inverse, it's gonna cost us quite a bit, but we know once we actually compute it, we are done. This is the solution. With an optimizer, you approach the solution, but maybe you don't actually reach it to arbitrary precision. Kind of gonna come back to that in a second. Sorry, I can't understand anything. Yeah, exactly. So we don't have an estimate of the uncertainty. That's essentially what you were saying, yeah. Okay, and this is something, let's try to go over this in the last five minutes or so. So yeah, obviously we don't have uncertainty. So maybe taking that away. So what could we do? For me, some ideas. Yeah, that's an idea. You probably, because the problem is actually convex, will get very similar results. So maybe that doesn't work too well, but sort of by using different learning rates and terminating early, that might be an idea, yeah? Something you've actually already seen in the lecture is maybe, huh? Exactly, Laplace approximation might be an idea. You've kind of already seen that, I guess. Laplace approximation is actually gonna be exact in this case because well, we're approximating a Gaussian with a Gaussian kind of. And we can just do that. The mass is literally the same thing you've already seen. You take this loss function that we've been working with. You take the gradient. You take the Jacobian of the gradient, which is the Hessian. And you can see that this is, well, this. And that means that the posterior covariance covariance is gonna be the inverse of the negative of that, rather. And that's literally what you would expect. It's maybe a bit difficult to see. You have to use the, so this is the precision form of the posterior covariance. So to actually get the covariance, you need to use the matrix inversion lemma ones. You've already had that on a previous slide. But we still need to compute the inverse now, right? This is maybe not super satisfactory. So another approach, and I'm gonna go over this very quickly and ask Philip to go over it another time next week, maybe, is we can work with something called, or that you could call, I guess, the sensitivity of the minimizer with respect to the data. So we look at the minimizer of that optimization problem. And we're gonna notice that, well, we can solve this in closed form, not computationally maybe, because it's expensive. But, well, the analytic form of that solution is gonna be just the solution of this linear system. And then let's note that the derivative, so the Jacobian of this expression with respect to y is just that matrix. So if we had a means of differentiating the solution of an optimization problem with respect to the training data, which it obviously depends on, then we can just use that matrix, and which is actually, in this case, the inverse Gramian, and replace it in all computations that use it. Have you got any ideas how to differentiate the solution of an optimization problem with respect to the parameters? There's two very, well, one obvious way and one a bit more involved way, maybe you mean like a finite difference approximation. Yeah, that's an option. Actually, I didn't think of that. That's a good point. You could maybe do that. I didn't think this through. Maybe that's actually something, if you wanna try it, try it. Interesting. So I guess the obvious thing is, gradient descent is actually just a string of computations which are the unrolled cause of gradient descent, like doing successive updates. That is actually differentiable. So you could theoretically just differentiate through the gradient descent steps of your optimizer because that is gonna span a computation tree. It's gonna be super expensive if you do many steps but that is a way to get at this. Maybe the fact that this is gonna be quite expensive and actually also quite unstable shows you that it's maybe not the best idea to actually do that. The other thing is something called implicit differentiation. Can read up on this, not super important. Nevermind if you're not interested in that but point being if you use implicit differentiation you also end up inverting an N by N matrix. So nothing gained there. So these are the two I would say standard approaches to differentiating minima of something by the way. But actually instead of just differentiating through the whole unrolled graph, we can just approximate by just differentiating with respect to the last step the optimizer took. So this is gonna be, and I realize I'm a bit short on time. So let's maybe make this the last thing we actually talk about. The last step, the gradient, assume we make N steps, right? So in the Nth iteration, we arrive at some alpha N estimate and if we do gradient descent, then that is gonna be the alpha N minus first estimate minus the learning rate or I guess eta is often used there times the gradient of the loss volume. Radiant of the loss function at the previous estimate. And now truncating that thing after one iteration means we assume that, and this is obviously an approximation, that alpha N minus one does not depend on the training data but we know that the gradient depends on the training data. So we can just differentiate this expression once with respect to the Y which actually sort of enters inside the loss or the gradient of the loss computation. It's an approximation of course, so it's gonna be of questionable, I guess questionable validity here but if we do that and essentially use the formula or a bit of a modified version for stability than the one I showed on the slides then we end up with this. It looks like it's approximating the uncertainty reasonably nicely actually but if you actually look at the sensitivity, so this Jacobian, it's just gonna be a diagonal matrix which comes from the specific form of the update and maybe that already tells you that this approximation is kind of bad actually but it's maybe a place to start and we're gonna go over similar things in a bit. Let me just summarize real quickly and questions we can talk about after the lecture maybe quickly. So exactly we still get at least and at least because the gradient descent might actually take loads of steps and cube time for this because we end up doing something at least as costly as inverting a system or actually inverting a system in both cases. So the story isn't quite told here and we're gonna elaborate on better ideas to actually approximate this in further lectures. All right, we've seen today that instantiating Gaussian process model doesn't cost you anything because it's this lazy functional object which you don't actually need access to any data. We can train a Gaussian process just as we can train a neural network or some other problem that requires risk minimization by gradient descent SGD, the SNGD here is debatable still. We can make point predictions so we can do inference at test time in OFN but we haven't really figured out how to get sensible posterior covariance estimates so we don't quite know how to do this gradient descent trick and figure out how to get sensible posteriors yet and this is gonna be the topic of coming lectures or the coming lecture rather. With that I'm pretty much done. Thanks guys and see you in the next lecture.