 I'm not going to start with like detailed feedback because the last lecture was given by Marvin Fertner and it's not like make sense that I go through his feedback but I will pick up, instead use the time to pick up three concrete like selected answers from his feedback which aren't really so much about his lecture but more about the overall direction that the course is taking because I think it's very important to help you understand why we're doing certain things that we're doing in this course. So maybe I can do this by going through the individual question. So someone asked, I'm missing the high level intuition. Why do we even use Gaussian processes? Am I not bringing too much information into the process by choosing my kernel myself? This is a very good question and I think it's also sort of a a snapshot of the state of the field if you feel that it's a bad thing to bring in information. It's the exact opposite. If you know something about the problem, that's good. Why would you not want to bring information into a problem that you have? So there's this sort of narrative that comes from statistics I think that you shouldn't somehow inject information into your algorithm because it creates a bias. And there are some applications for example in science but this is actually true. You don't want to accidentally convince yourself of something that's patently wrong by injecting your prior beliefs. But if they actually are correct then they can help make the algorithm work. And we saw in numerous examples so far that in fact no algorithm will ever work without any form of prior assumptions. You have to make prior assumptions. And some of the kernels that we've used in the examples amount to relatively vague prior assumptions like assuming that the function is twice continuously differentiable or so. And no other model that we could ever construct in a machine learning lecture or actually anywhere else is going to be without assumptions. A deep neural network is also not without assumptions. It's just that the model is so complicated that we've stopped thinking about the assumptions that are in it. If I give you some kind of untrained transformer with five layers, it's very difficult to think about what the assumptions are but they are still there. The advantage of these Gaussian process models is that we can look at these things, we can draw samples, we can make little plots of marginals on low dimensional subspaces and we can ask the model how it behaves. That is an answer to the first question. But of course, there's something else hidden in this question which is why do we have to build these very structured models? I've heard that there's this other thing called deep learning. Why don't we just use that? And maybe the second question kind of goes in this direction. I may be wrong about this one but it looks like we'll only cover GPS. Do we have a plan to cover extra topics like variational inference, Markov chain sampling and so on and so on given that this is a probabilistic machine learning class and it's compulsory, I believe it's better to cover a wide variety of topics that are still relevant than going in depth about just one topic. So this is university teaching. We're at a master's level. So and we're not in math or I don't know economics or law, law is different because there are things are dictated from above. Machine learning is moving at an absolutely rapid pace. I'm sure you'll all agree. And there is no authority in the world for what is relevant. There is no start-sex exam, right? There is no one who sets a list that says this is the thing everyone needs to know. And everywhere across the world people who teach classes like this have to make decisions about what they teach. Now there are some textbooks but first of all the people who wrote these books aren't necessarily the unique authority on everything. I know some of them personally so I know that they are not. The other thing is that most of these books are by now 10, 15 years old. And when they were written, the field looked completely different. So when I first taught this class in 20, well a variant of it in 2013 or so, it looked completely different. We spent a lot of time on graphical models, factor graphs, some product algorithms, message passing, approximate inference with expectation propagation and variational inference and so on. And every year, every time that I teach it, it's roughly every other year, it's basically a new course because the field has changed. And towards the end of this, so I'm gonna tell you in a moment what we are going to do next in this course. And towards the very end there will hopefully be some time to do some outlooks into other kinds of probabilistic machine learning. But this term I've taken a very conscious decision to focus a lot on supervised machine learning, not just GPs, we'll also talk about deep neural networks because they currently dominate the public discussion of machine learning and also probably dominate what you might be asked by future employers to do. And I do this because I think that there are some things to understand that are not widely understood yet outside of maybe even Tübingen, that I find quite valuable to help you understand how deep learning works. And hopefully we'll get there. The downside of this is that a lot of this is a bit seat of the pants. You notice that the lectures aren't perfectly aligned to each other, we sometimes run out of time, there's the order doesn't quite work. That's because I basically make the lectures like one week in advance and that's just the price we pay for that. But I still hope that in hindsight, you'll think, okay, I actually learned something here that is quite useful. Because once you go and work anywhere, whether it's in science or in industry or I don't know, your employers are going to ask you, so what have you learned, right? Or maybe they'll not ask you, but they'll find out once you work there. And I'd like them to think, oh, Tübingen people learn interesting stuff. And I don't want them to think, ah, this stuff is all outdated. So there are also universities out there who have, which have maybe one AI research group or maybe none and there's just a research group that does something vaguely related who then has to teach this class. And there the class would look very different. But here we have this entire rich tapestry of lectures available that also allows me to focus a little bit on one thing. I know that other colleagues in great places across the world do very similar things, not similar content, but they take decisions to focus on one particular thing. So I'll tell you more about what we want to do in a moment. The other question is not unclear, but why do we need to read so much about Cholesky and linear algebra? We're going to use a library function anyway. So today's lecture, weirdly, will be entirely about Cholesky again. And I'm gonna tell you in a moment again why, but this is the first way to explain why. It's true actually, you are going to use a, if you actually do GP regression in the textbook case, as you know, like we'll do next week as well on an exercise and then we're done with GPs. If you do that, then yes, Cholesky is just something you call. And if you've been in a numerical linear algebra class before, people have told you, don't ever implement this yourself because it's so over-optimized. There are these Fortran code things lying around that are optimized for every architecture, that's true. So today I'm not going to show you code, well, at least not like Python code, I'll show you Pluto code. Precisely because I don't actually want you to be able to write these algorithms yourself in Python, but because I want to understand how they work. And with that, I can maybe get to the next slide and then things should become a little bit clearer, hopefully. So the course, the point of this entire course, for me, if I had to put it into like one sentence, is to develop a probabilistic perspective on contemporary machine learning, one, it was actually two sentences. One in which the process of learning is phrased as inference, that means manipulating probability distributions on a hypothesis basis. This is what I would like you to think about. So there's all of machine learning, that these algorithms that take in data and they use it to refine a model. And that process can be thought of as computing posterior distributions. And actually all, pretty much all of machine learning, well, all of machine learning can at least be understood from this perspective and a lot of it can be very directly derived from this perspective. So that's what I'd like to achieve. And because we only have one semester, we have to focus on some bits. So, you know, two years ago, there would have been a section on mixture models, on expectation maximization, on variational inference and so on. And towards the end of this class, maybe we have a little bit of time to point in a few of these directions. But these are not so prominent tasks anymore in contemporary machine learning. So we focus a lot on this supervised setting of pairs of inputs and outputs on which we learn some form of map, some function in general. So far, we focused on a textbook base case of this, which is learning functions that map from some input space X to the real line, or actually to multiple copies of the real line, so a vector space. This is called regression. It's a special case, but it's very interesting, and that's why we spend so much time on it, because it provides a very, very clear answer in the form of Gaussian process regression with this numerical linear algebra layer below. And we'll talk about that lowest layer on the computer, the numerical algebra, today for one final time, well, find one final time in the GP world, to really understand how these algorithms work and why they work in a certain way, because they explain why it's possible to learn in finite time with full precision efficiently. And then from Monday onwards, we will turn to problems where there is no closed form solution anymore from the probabilistic perspective. That will first be classification, and I will talk about classification, but try to solve classification with Gaussian processes. This won't quite work, and so we'll have to sort of start approximating things. That will be next week. And then the week after, we replace the Gaussian process with a deep neural network. And that replacement will actually be surprisingly simple. It will basically be one lecture or maybe two, and then we'll suddenly have all of supervised deep learning at our control. And when we get there, we will notice what many of you already know because you've taken a deep learning class, which is that deep learning is sort of as being cheap somehow, which is a bit weird because it's simultaneously also one of the most computationally expensive and demanding things that humanity is currently doing. So actually it might be that deep learning as of this year is now the most energy intensive thing that computers do across the world, cumulative. It used to be climate simulations, right? Depending on how you cut the space of tasks. Now it's probably deep learning. Or maybe in between it was at some point crypto, I don't know. So there's this weird thing that's sort of hanging in the air, that there's this technology called deep learning, which sort of, if you look at it from far away, it's just super expensive. It's this very, very, very difficult task that involves basically supercomputers, but these days we call them data centers, right? With like massively parallel hardware that runs for months on end to be trained. And training one of these big models costs hundreds of millions of dollars, literally. And yet in an academic setting, when we talk about different types of machine learning algorithms, when we compare to Gaussian processes, we think of Gaussian processes as this expensive thing because it's cubically expensive. But the algorithms you've now encountered, right? So far they seem very simple, very fast. You can do them on your laptop with like in a few seconds. Why? Well, because our data set is small. We've looked at toy data sets with like 20, 100, a few hundred data points. And we realized that inverting a matrix, which is the underlying mathematical task that we have to solve, is in principle cubically expensive in the size of the data set. And if you have a hundred data points, then a hundred cube is fine. It's just 10 to the two times three is 10 to the six, a million operations. That's something that your machine does in a blink of an eye. But if you have billions of data points, that won't scale. And for deep learning, so when you ask people why kernel methods, so this is GPs are basically a kernel method, right? We realized that we're doing a computation that involves the kernel. Why these kernel methods are not so popular anymore than one of the typical answers that you might hear from experts in the field is, eh, it's because they are expensive. They involve inverting these matrices and that's too expensive. So what I want to achieve today is to get a much clearer picture of where this n cubed arises. And the main thing I want to get across between the lines, it's sort of an abstract insight is that this comparison is not fair. So when you think of deep learning as something that is O of one, where I'm writing one because, so okay, hands up, who's been in the deep learning class? That's the majority of you, but not everyone. Who has not been if you dare raise your hand, okay? Good, so we will talk, it doesn't matter. It's fine, you can still stay here. We'll talk about it a week from, well, the Monday after next Monday, right? But basically in deep learning, there is a big sum in the training process that goes over the entire data set and that sum gets approximated by subsampling the data on so-called mini batches or batches. And that means that each step of the algorithm, whatever it is, SGD or these weird, this weird zoo of methods called Adam, Madame, Madame Saddam, those methods, they don't have to touch the entire data set every time they take a step. And so therefore, we could say they are O of one. It's a bit cheating, but it's sort of, it's the feeling that people have about these methods. While these algorithms that we've talked about so far, Tolesky Decompositions, these are very vigorously mathematically analyzed. And this statement here means this algorithm is done after n cubed steps and then there's nothing more to compute or O of n cubed steps. While this is one step of an algorithm, and you have to take many of these steps, we just don't know how many. We actually don't know how many, to be honest, because these methods do not converge in finite time. They only converge asymptotically. But the nice thing about them is that you can sort of stop them any time and then you'll get out something that has done something, it's sort of changed the model and that's called training. And training, also this word training, it sort of conveys this process of sitting in front of the machine and watching and turning knobs and waiting for it to converge. And those of you who have trained the deep network know exactly how this works. So in GPs, the training process is just this linear algebra bit. That's it, it's just calling this linear algebra library. And so this comparison is sort of, that's a comparison between a lower bound, a very loose lower bound of one step and then upper bound for what you could possibly have to do to be finished. So what I want to do today by looking very deeply into Joleski decompositions is to get a much clearer picture of this spectrum between O of one and O of n cubed. Because what these linear algebra methods actually do when you look inside, this is what Marvin has already started on Monday, is a process that can be understood from the perspective of machine learning. It's not something that the numerical mathematicians have done for us 50, well it is something that the numerical mathematicians have done for us 50 years ago, but we can now look at it from a contemporary perspective of our field and understand it from the perspective of our field. And spoiler alert, it'll turn out to be a data loading process with some bookkeeping. And if you've trained a significant, like non-trivial deep neural network, you know that you usually have something called a data loader. And for very large problems, in fact the data loader is actually a piece of software that takes care of IO, getting stuff from hard drives that sit somewhere and feeding them into these big machines and sharding them out. And maybe after today's lecture, you may come away thinking this data loader for big deep neural networks, maybe that's something people should look in a little bit more because it's not yet what it should be. So let's go back to Joleski's compositions. This is one of the last slides that Marvin showed you. I've basically just copied it out. Here is this algorithm, which, and today we're only going to look at pseudocode, which looks like this. So here is how I want you to think about this. There is this algorithm that sits on a low level numerical routine provided in Fortran written in 1978. And then there is that even like a special version of it that is optimized for your chipsets. So depending on what kind of laptop you have in front of you, there are different libraries underneath that farm this out to the different architectures on your chip. And it computes these matrix decompositions. And today, we do not want to understand exactly how this works on the chip. So we're not going to implement this ourselves. What we want to understand is the structure of these methods. So I am going to show you sort of tedious complicated equations. And sometimes I'm going to wave my hand a bit and say, ah, this is not so important. I just want to get the high level thing. This is kind of the mindset we want to be in. We don't want to be able to implement this ourselves, but we want to be able to understand what it actually does. So there's this method that clearly contains a for loop. And in that for loop, it repeatedly computes something involving a row of the matrix and a column because it's a symmetric positive definite matrix. So if you get one row, you also get one column for free, sort of. And then it does something with that row. That's this line five, which you can use to update this thing called the decomposition, this lower triangular matrix L. And since you've used the code and looked at it in our Gaussian dot pi library, you know how we use this thing. When we train, we call it once to get this lower triangular decomposition of the matrix. And then it turns out that that lower triangular decomposition is actually some form of, it's a form of data structure that allows us to do everything we want to do in probabilistic influence afterwards. That's actually something to reflect on for a moment. It's a really cool thing, right? There's this one thing, this one data structure that we can construct, which is quite an abstract object. It turns out to be a lower triangular matrix, sort of. And that's exactly the thing we need for everything. It's the thing that we need to compute the posterior mean of the Gaussian process. So the point estimate for a function, for a function that we can evaluate anywhere, to compute the posterior covariance. So that's a joint error bar over all of the function values that we can compute everywhere. And in fact, actually also to compute the evidence, the thing that we might use to optimize the model, because that involves an expression that looks a lot like computing the posterior mean or covariance and then a log determinant. And I don't think we've had a slide on this yet. Maybe I can show it at some point. It turns out you can also compute a log determinant of a lower triangular matrix very efficiently without showing you a slide. Just, you all know how determinants are roughly computed from your first year undergraduate. It involves something with all the possible combinations of rows and columns. And for a triangular matrix, that involves only the diagonal. So you can go through the diagonal and then you have a determinant. It's pretty much the product of the elements on the diagonal is the determinant of this matrix. So that's kind of neat, right? We can do all of this with this one data structure. So what this clearly does, this thing, is some form of bookkeeping. It sort of rearranges the elements of the matrix or does some nonlinear computations on them. By nonlinear, I mean it divides by stuff. So there's a one over square root of something. That's not a linear computation. And then sort of in doing that produces something that we can then very efficiently work with. And Marvin spoke about this very efficient thing. So the very efficient thing is that once you have this triangular decomposition, when you need to solve any arbitrary linear problem on this matrix. So basically if you want to multiply with the inverse of a matrix, that multiplication with the inverse consists of this back substitution process. So I'll draw a picture if you have a matrix that looks like this. And there's an unknown vector here, x. This is L transpose and we want to know B. So you want to know the vector x, such that L transpose times x is equal to B. So this is like A times x is B sort of thing, right? Then, well actually to solve A times x is B, we need sort of both of these. But we can do one after the other. By first considering this first part, and then you can recursively solve this problem by starting at the bottom. Like I say, this element here is just this divided by that. Now we know what this is. We plug it in, we sort of plug in the value here that we need for this. And then divide by and just, you just have to look at one more point and so on. So solving this problem is then quadratically expensive in the size of the matrix. And quadratically expensive is as expensive as multiplying with a matrix. So because it's so efficient, you basically don't have to worry about explicitly writing down the inverse anymore because you can just do this process every time it needs to be done. And that's so close to multiplying with a matrix that people don't care about it anymore. But for our purposes to understand what a method does, we could say, well, maybe I would like to know the inverse of the matrix after all. So somehow the computation of the inverse is sort of prepared in this process. Can we also make it explicit? Can we talk about what actually happens here? When we compute, if we would compute the inverse afterwards, could we also have done that during the algorithm while it runs, while this Schroesky decomposition runs? And yes, we can. So let's try to do that for the sake of understanding where it would come from. So let's say I wanted to have an approximation for the inverse of the matrix which gets constructed iteratively as the algorithm runs. By the way, one reason why you might want to be able to do that is so that you can stop the algorithm midway. That would be nice because it would get us a little bit closer to this deep learning setting where you can just stop whenever you want. So if I wanted to have something like this, well, then we'll just sort of take this at face value. We just say, well, there's this thing that is approximately the inverse. So I'm going to treat it as if it were acting like an inverse. Then I can take the Schroesky decomposition at time I, so while the algorithm runs, which is this lower triangular construction, and just multiply this C from the left and the right. So if I multiply C from the left and the right, then I get one, well, actually, no, sorry, let's multiply with the actual inverse from the left and the right, and get something that looks like C. So here's the construction. So if I take this and I multiply from the left and the right with A inverse, then on the left-hand side, I get this. So there's one A inverse from the left and then one from the right, which I can take into the transpose, so it's now here. And on the right, I've multiplied with A inverse from left and right. So one of the A inverse gets canceled with the A and I'm left with an A inverse. So this gives me an object that I could think of as an approximate approximation to the inverse. Because on the right-hand side, there is an approximate inverse, well, there is an inverse, and on the left-hand side there's an approximation to it in the same sense that LL transpose is an approximation to A. And after n steps, this approximation is actually exact. That's sort of a reason to treat it as an approximation. Now, having constructed this object, we can look at how it gets constructed, or whether we can find an expression in the algorithm that we can directly use to construct it. So we look at this last line of the update for L, where, by the way, on the left-hand side, this is the exact same algorithm as before. I've just added a bunch of empty rows where we can copy stuff in. So there's nothing new yet. Now, we look at this last row of L, and now here's one of these bits where you don't have to exactly understand how the math works. You just look at the expression. You sort of look at, there is an II here. There's an LI that we already kind of know. This comes from the previous iteration. So it probably will allow us to reuse something. And there's a new entry called II, or little LI. I don't even know whether this is an I or an L. I have to check. And that's I is given by this line seven. So we plug it in. In that line seven, we have this EI over the norm of EI. So that's the row that is being loaded. And we plug in the expression for A prime, which is from the line nine. And we find this LL transpose expression, which we can now express in terms of CI minus one from the line above. So, yeah. So this gives us a line for how we would update which term to add to our approximation for the inverse to get a running update of an estimate of a matrix inverse. And now let's think again in the high level, this is a process that while the algorithm operates on the data set, so it loads columns of the gram matrix of this matrix, which is now here called A, which in general is a symmetric positive left in matrix, but of course for Gaussian process regression is this KXX plus sigma times one object, it keeps updating an estimate for what the inverse of that matrix is, which we need to compute our posterior mean and our posterior covariances. So in fact, this means that we can think of this Cholesky decomposition as it runs as also computing an estimate for the inverse of the matrix. And we call that estimate C. And you can update the code for the algorithm, the pseudo code, which we wouldn't actually implement ourselves in Fort Van, but we could write it this way with an additional sort of hook that returns the inverse of the matrix. And this is now actually an algorithm that we could stop anytime. And then if we did that, it would give us an approximate inverse to the matrix, which we can use to compute a posterior mean. Just a moment and we finished that thought. So there is this algorithm that so far we thought of as a black box. Now we've appeared into the black box a little bit and we saw that it actually contains a for loop. This is kind of nice because we're worried that we have to pay N cubed operations and we're competing with a different type of machine learning, which is at best O of N, maybe even O of one. So we realize now that we can sort of get away from N cubed to N squared by stopping that for loop early after like five iterations or 10 or whatever, or a hundred because whatever it is, as long as it's less than N and not dependent on N, we could sort of hide it in an O of N squared expression. And while that what we've done is we fixed the part of the algorithm that makes sure that produces the thing we actually need, which is this matrix inverse, which we could now replace by this CI being constructed by the algorithm. There were two questions. What this means? We'll get to that in a moment. Whether there should be a transpose, that might be, so the question is whether there's a transpose or not. It might be, but actually this is one of these things where it doesn't matter so much whether we get this exactly right. Maybe there's a transpose here. The important thing is that we can compute this object because it involves this thing, the I, which I've written up here, which involves, let's look at that actually and think about, actually no, no, there's no transpose. I'm pretty sure there's no transpose. This is correct this way around. Why? Because this object, which we need to be able to construct the CI, this object involves SI, so SI is our new name for loading the row I of the matrix. So DI is SI, one times SI, right, so SI, minus the old inverse estimate times A times SI. So what we need to do here is we need to multiply the matrix A with S. So and since S is EI, this means loading one row of A. And then we have to take that row and multiply the old approximation to the inverse to it from the left. And if that old approximation to the inverse is something that we've constructed iteratively previously, it's probably fast to multiply with because it's low rank. We'll see in a moment. And A times SI is going to be the dominating object, right? It's loading this one row. And then we can sort of continue with the construction and see that we can rewrite everything else that's needed to do by constructing this inner product. This is an operation that is now O of N linear in time. Why? Because we've already computed A times SI, so we know, ah, I know why there are no transposes because A is symmetric positive definite, so it doesn't matter if there's transposes or not. So this thing we've already computed, so we don't have to be computed. It's available. This is just a vector, so it's just an inner product of two vectors. It's linearly expensive. Here it is. And that turns out actually to be this thing because SI is EI. Now we get an update for II, that was already in the previous version of the algorithm. Then we can update the inverse of the matrix by what? By dividing by a scalar, that's for free. Doesn't cost anything on a computer pretty much. And then computing an auto product of rank one objects of vectors. So this object is low rank. It's a low rank matrix that we're dragging around. Okay? Yeah, that's a bad notation. In a moment, I'll get rid of the II anyway, and then we can, we don't have to deal with it anymore. Well, it's sort of as constructed by this algorithm. So it starts with an empty matrix and then it just does that. Yeah, if you actually keep it like this, then yes. But of course you could think about storing it more efficiently if you want to do iterative updates. So if one actually wanted to implement this algorithm, I would then think exactly about how to represent these objects. For example, in this representation, we notice that we can think of the inverse approximation as starting with an empty matrix and then adding rank one terms. So maybe the best thing to do if you're planning on running this algorithm only for a few steps is to not actually build this matrix, this quadratically big matrix, but instead just to store these rank one objects. Because then when you need to multiply with it, when you need to multiply from the right-hand side with B, which in our case is this, is Y, then we can do this in our product, C I times Y very efficiently in O of I times N. Why? Because so if C is a bunch of things that look like this, where this is I, and this is N, then if I multiply with something of size N, a vector, then each of these multiplies is O of N and I have I of them, so it's O of I times N. And then afterwards, I represent this in here, but that's the same kind of operation, so it's again I times N. So, or at best I square times N, maybe there's a square for the I, but definitely linear in N. So that's cool because I can directly compute these objects that we care about. So actually this means we could rewrite, we could rebuild, like if it were 1970 and we could sort of have foresight and realize that at some point people want to build Gaussian process algorithms, we could write a linpack, the precursor to all the modern contemporary linear algebra packages that everyone uses under the hood, to directly do this, to compute inverses of matrices. But actually we could do even one way better. So if you think back to how we implemented Gaussian processes in Gaussian.py, you'll remember that we actually store two things, not just the Cholesky decomposition, but also these represent the weights, which are these things. So it's C I times Y minus mu X, and sometimes we call them alpha on some slides. So it's a vector that gets multiplied with the kernel function to produce the posterior mean. It's these weights for the data points, sort of that are basically reconstructed or linearly mapped data points with the inverse of the matrix to, which then very efficiently allow us to evaluate the posterior mean. And maybe we could call training a Gaussian process in analogy to training a deep neural network as computing these two objects alpha, which is not on this slide yet, but it's C I times Y minus mu X, it's the final bit of this equation, and this matrix C I. So let's see if we can do that, right? In our algorithm, inside of our lower level routine. And that bit of course is relatively straightforward because if you already have the inverse of the matrix called C I, then we can just multiply with Y minus mu. And so I've, uh-huh, lost focus. So you could add one more line to the pseudo code on the previous slide. And use that to compute alpha. So alpha is C I minus, well, okay, I'm gonna write Y, of course, there might be a subtraction of a prior mean, but that's trivial, right? You can do that before you even start processing. So the important bit is that we're going to multiply by this approximation to the inverse of the matrix. And now we can sort of look at what this actually does in that one row of our for loop. So we plug in the corresponding line, which is this old inverse plus update. And then notice that we've already computed this old inverse times Y in a previous step. So that's called the previous estimate for alpha. And then the update is, well, let's plug in what D I is from a line above, up there, line five. And we notice that this is actually the thing that we need to compute. And it involves the old estimate for the solution times one over some step size, times a projection of the data onto something involving the I unit vector. And then this is the equation of sort of what's supposed to be computed. And then now you can go back to this code and kind of stare at it for a while and convince yourself that a lot of the quantities that are in here have already been computed in other lines of this algorithm. So they are available and we can sort of plug them in. And it turns out that updating alpha I actually involves only quantities that we've already computed. So we get it for free. So we could use this algorithm to as a specific tool, not for linear algebra library called Cholesky, but as a specific algorithm for Gaussian process regression to add these red and blue lines. Red is a bit difficult to see, but it's lines four, five, six and eight. And then the blue line is line nine to sort of enrich Cholesky such that it does two things. A, it computes an approximation to the inverse and the solution of the linear problem in iteration I. And then secondly, because we do that, we might allow ourselves to stop the algorithm early. So this has now suddenly become an iterative method, something that we can just run on a data set for a bit and then stop before we've gone through the whole data set. And that gets us closer to how deep learning works, right? So far Cholesky sort of took one epoch, right? It went through the whole data set once. You have to give it the whole thing and call it and then that's just it. But actually, we could think of it as going through the data set one row of the matrix after the other and computing an updated estimate for the posterior mean and the posterior covariance. And now before the break, I want to briefly tell you how exactly that update actually looks like, where this actually comes from. What do I do this now? I do this after the, no, I do it now. So this is the sort of numerics layer. Let's think of the model layer or the sort of this, what did Marvin call it, the application layer? I don't know, like the bit where from your perspective as the user what actually happens when you sort of condition on a data set. Now let's for a moment put Cholesky aside in our heads and think of another setting. Having realized that Cholesky goes through the matrix one column after the other or one row after the other which doesn't matter because ASC metric. We might think that this has something to do with processing the data set one datum by another, right? One element of Y after the other, one data point after the other. What would that look like if we had to do it without access to Cholesky? So let's say we are doing Gaussian process regression. So we have some Gaussian likelihood and some Gaussian process prior on all of F which is in particular also a Gaussian prior on the values of F at the trading locations FX. And now, so we're going to simplify the notation. I'm just going to use a capital K for this gram matrix and the Y bar for the data so that we can think sort of clean up the notation a bit. And now let's say we have started at some point with an empty data set and then we iteratively added one datum after the other. For example, because someone handed the data points to us one after the other or maybe just because that's what we like to do. Then at time I minus one, we have two objects, a posterior covariance and a posterior mean. Let's look at the posterior mean. It looks like this where I have used sort of Python's or NumPy inspired slicing notation. So what I mean by this is it's the posterior mean at time I minus one at locations capital X is the prior mean boring plus the kernel evaluated at all of the training points on one side and then the first up to I training points. Then the kernel gram matrix on the first I training points and this mean corrected data set up to the I data point. And that, so this is the all the data up to I minus one plus the new data point at I. So this actually has a structure. If you look at it as a matrix that involves sort of the old data and the one new datum both in the kernel map and in the matrix and in the data set. So here is a long vector that contains all the data up to I and then one more. This is a scalar and this is a block matrix which looks a bit like this thing. So there's some kind of peak up to I, up to I. This is the one we want, this should be up to I here. There's a block matrix and then there is a single vector shows up here and there and a scalar up here. And what we need is the inverse of this matrix. Can you follow? No, okay. So if I give you I minus one training points you do your culture, let's give whatever. And now someone gives you one more data point and you say add that to your train model. For a moment, think about deep learning. What would you do? I don't know. Somehow do more SGD and hope that it works. I don't know what would happen actually. Could be bad. We'll get back to that. For Gaussian processes we can think very, very carefully about what we should do with data. It's information being processed. That's the core of probabilistic inference. We already have this stuff. Now we get one more datum and what we need to do to infer the function on the entire data set is to take this matrix and invert it. And this inverting a matrix that's for Gaussians that actually is the entire inference. And what we need to do is we somehow need to take care of how much the new data point co-varies with the old ones and how much it varies itself and then somehow set those into relation. So keep in mind that kernels are covariance functions. So this matrix actually contains the covariance of the old data, the covariance of the old data with the new one, the variance of the new data point and the transpose of it. So this is the slim thing. That's the square and that's the scalar and that's the big matrix. So linear algebra is going to take care for us of bookkeeping, of thinking about how these things relate to each other, how the new data relates to the old stuff, how much it tells us about the function given that we already have the old data. Literally given that because that's what a conditional distribution is. And so we could think about how to compute this matrix inverse, but actually we don't have to because Isai Schuer already did it for us in the, I don't know, 1940s or so, no 30s probably because he was dead at 1940. Actually much earlier in his, he did this already during his PhD. So, and he provided this sort of equation which you can find in various textbooks in this form. So he says if you have a matrix that looks like this, this is a matrix A, which can be, sort of separated into blocks where each P, Q, R and S is a matrix, a rectangle or matrix in general. Then you can compute the inverse of this matrix assuming it exists as this object, assuming that all the inverses in that matrix, in that representation exist. So if all the inverses on the right hand side exist, then the inverse on the left hand side exists and vice versa. So, okay, a complicated expression, you know. So what actually happens here is that we reuse the inverse for the block matrix up here that we might already know. Interesting, that's good because actually that's exactly our setting. We already know the inverse from a previous iteration. Plus some correction term which involves how much the new data co-varies with the old one, and in particular an object M, which is represented here, which is called the Schur complement because of him. And it looks like this. So instead of like to simplify for us thinking about this expression, I've changed the notation here to plug in the things that we use in Gaussian process regression. I don't, like that doesn't necessarily make it easier to read, but you can go back and forth in your, like after the lecture yourself to think about how they relate to each other. So let's look at the simple notation without a relation to our notation again and then go back and forth. The two important things to notice is that there is this object called the Schur complement that we have to compute. And what's the size of that object? It's a scalar for us. So this inverse here is literally a one over. It's a floating point inversion, easy. And what else do we need to compute? Well, we need the old inverse which we already have from a previous iteration. And then everything else is just multiplications of this scalar in a form of an outer product with these rank one objects Q and R. And we can write this sort of in suggestive notation like this and say the new inverse is the old inverse padded by zeros plus an outer product of two vectors with a scalar inside. Now, if we now plug in the notation from the pseudo code for Cholesky decompositions, we can find that it actually matches exactly what we've done so far. And we can see that this iteration is exactly the update for our inverse that is in the Cholesky decomposition. So actually Cholesky and Schur did the same thing. They just talked about it very differently because Schur was thinking about it from a group theoretic perspective. He wanted to compute the actions of some ideals of groups on others. And Cholesky just wanted to fire his artillery piece really fast. Did Marvin tell you about the artillery story of Cholesky? No? Okay, so there's a reason why the picture of this guy looks like this. He was an artillery officer in the French army during the First World War. And he didn't ever write a paper because he was too young to write it. He died in the trenches in Northwestern France while shooting artillery at German forces. And the reason he's known to history is that when you set up an artillery piece, you have to actually solve at least squares problem. So you first, do you do your first shot? You notice where the shell lands. You know where the Germans are. You want to correct a little bit. So you need to make a correction to the settings on the artillery piece. And how do you do that? Well, you solve at least squares problem because you have to do it, it's 1918, or actually 1915 maybe. You have to do it really quick because they are shooting at you, the other ones. Seems related to 233. So you pick out your notebook because it's 1914 in your Cholesky and you very quickly have to do matrix inversions on a three by three matrix. And it has to be really fast because your mates are sitting next to you waiting with the shell saying, what do you need to do? So he found a way of making it twice as fast by saving some computations on the Gauche elimination algorithm. And he told his friends, in particular his friend Benoit, he said, I made it twice as fast. Now we can shoot twice as fast at the Germans, pretty much like that. And then he was killed by a German artillery shell. Sweet irony. And his friend went back, studied math in France and at some point he wrote this report on an algorithm for fast computation of least square solutions. That's actually what it's called in French. And added in the title, as relate to me by Commander Cholesky. Sort of as in memory of his friend. Now, sweet irony of history in the exact same year that Cholesky was born, Shure was born in a far away Eastern part, well, Eastern part of Europe, far away from where Cholesky was exact same year. And they ended up working on the exact same thing in, well, the two countries that were at war at that point because Shure then went to Germany for a while, at least before eventually taking refuge in Palestine and dying there almost immediately. And they worked on basically the same problem from very different mathematical perspectives. But you can think of what both of them did in terms of this algorithm, which we can now use for Gaussian process regression. Which iteratively starts with an empty estimate for the matrix inverse. And then recursively loads one row of the matrix, computes a bunch of, does one expensive computation which is here. So it multiplies, well, that's the loading part, right? It reads out one row of the matrix from memory. That's the IO part. And then it does a bunch of O of N operations, just inner products of vectors and inverts a scalar. And then updates the two things that we need. CI, the estimate for the inverse, and alpha i, the estimate for the solution. And what we've just convinced ourselves of is that what this thing does is it literally reads the data set in one after the other and updates a Gaussian process posterior. We start with an empty prior. Here's a data set in black. We start with the very first datum in blue. And where that first datum is, well, depends on how you order the data set in your memory, I've just taken a random one. It conditions on that. Now we have CI after the first step and alpha i after the first step, represented by the posterior mean and the posterior variance. Then we have a second data point and a third data point and a fourth data point and a fifth data point and so on. And they just get added one after the other to our training procedure. Just like when you train a deep neural network, you get one batch after the other. And batches sometimes even are individual data points if you like. A batch of size one. That's what happens inside of the Cholesky method that you've been calling so far without thinking too much about what it actually does. It literally loads the data set one after the other and we just decide to let it run to the end and then say, oh, but that's cubically expensive because it needs to run through the whole data set. So Cholesky is just a data loader. One that takes good care of the right bookkeeping to keep track of how much you've already processed. And this bookkeeping costs us something. That's why it's more expensive. That's why the entire operation is not O of N, but each step is O of N squared so that the overall thing is O of N cubed because we have to keep track of all the stuff we've loaded so far. Actually, each step is O of I squared times N, but if I goes to N, then I and N are the same and they're still N cubed. And this is in contrast to gradient descent which we use for deep learning, which Marvin has showed you you could also use to get a point estimate for the posterior mean of a Gaussian process if you really wanted to. And gradient descent doesn't do this bookkeeping. It never looks back. It's just a local thing. And the price we pay for that is that it never finishes because it never keeps track of all of the stuff it has touched. And now we take a quick break. I know it's a little bit late for a break, but you probably like a break anyway. And I'll continue at 9.20. And if you have questions, you can also just come to the front. Okay, so we've now constructed this algorithm which I have now cleaned up a bit to put it on this slide. We keep adding a few things to this algorithm and I want to talk about this slide for a little bit because it actually has the main steps in it now. So this is sort of an adaptation of what we've constructed over the first, well, a little bit more than half of this lecture. It's a method that started out being this black box, this piece of Fortran code called Cholesky, right? JaxNumpy.Cypie.Cholesky. And by the way, I encourage you to play around if you just want to have like, shut off your brain for a bit, try and do this deep dive into where Cholesky actually ends up. So you start at JaxNumpy and then see how far you can get in the source code to something low level. And eventually it's not going to be Python code anymore, right? You will make it to some Fortran that keeps being called. And when you do that, it's like a time machine that takes you back. Like JaxNumpy was written like, I don't know, a year ago or so, like the version that we currently use, it's constantly being updated. And the Fortran code you're looking at is probably from the 70s or the 80s depending on how far you go. And I find this sometimes useful to remind us of this because the original linear algebra methods that we use today, they stem back to these things called LinPak and IcePak. IcePak is for eigenvalue solver package and LinPak is for linear algebra package, which became BLAS and LAPAK and ATLAS and so on. But these original methods, they were built by a bunch of PhD students in the 70s at some university somewhere in the U.S. They started out as pretty much open source packages we would call them today. Today they would be on GitHub and would be PIP installable. Back then, of course, that didn't work this way. You had to actually send them a letter and they would send you, I don't know, the source code so you could be implemented on your own machine or something. But this is how numerical methods are being developed. They don't magically appear. It's not like there's a company that sets out to build an open source linear algebra package, right? There are actually algorithms that someone has thought about very carefully and yes, they are tedious, but they change the world. It's one of, this is probably one of the most widely used piece of computer code. Like DeepMind claims that sorting algorithms are more widely used. I don't know. Could be that linear algebra is even more widely used. So now we started out with this Fortwa and piece of code and we've sort of lifted it back to 2023 because we're doing something very specific with it. We do Gaussian process regression. And actually, maybe Gaussian process regression is the modern way of thinking about solving least squares problems. It's just a bit more explicit about the fact that there is some data involved. So this algorithm, it takes in our training data and in a specific form. So the training data is pairs of X and Y, right? Inputs, outputs. Stack together to form capital X and bold Y. So we take, we do a pre-processing step where we take the data, the outputs Y and we subtract the prior mean because whatever. Maybe it's zero if it's not, we just subtract something that we like. We've done that in some of the examples that we did. Then we take all of the inputs X and we call the kernel function on it. That's what we do in our Gaussian process library. That's the modeling part. So we built this kernel gram matrix and then we call it capital K. There's also this noise term on it. So now we have this matrix K. So what's left is the linear algebra part. Modeling is now over, computation starts. And now what this method does is it will eventually return the things that we need to estimate the unknown function F with uncertainty. And those things are, I'll tell you in advance, a nice data structure that tells us what we have read so far, we'll call that S. That is a bookkeeping thing that tells us what the bits are that have so far been processed. And eventually they become the entire data set. C i, which is an estimate for the inverse of the kernel gram matrix K, which we need to compute posterior error estimates, covariances. And alpha i, which is an estimate for the solution of the linear problem K inverse times Y bar, which we need to compute a point estimate, the posterior mean. And now this method does the following. It starts by setting the estimates for C and alpha to empty stuff. So it's like, depending on how you want to implement, it could be an empty array or a zero matrix or something just empty data structure. And then it does a for loop that runs. It doesn't run forever. It's not a while true statement. It's for i from one to size of data set. And then it's done afterwards. That's different to gradient descent, which is never actually done. We just stop it at some point. And then in each iteration, it loads, it decides that the next thing to do is the i unit vector. So that's a vector going 0 0 0 1 at i 0 0 0 0 0. And then we multiply that vector with K. That's like loading one row of K. That's the i 0 part, part one. Load this row or column of K. Remember that K is symmetric, so it doesn't matter whether we load row or columns. And then we construct this in line six, this vector di, which is given by either of these expressions. So the left-hand side is kind of what we want conceptually and the right-hand side is what we actually compute because we already have z i K times s i, so we can save ourselves some time. Now we have s i minus the current estimate of the inverse times z i. And then we compute a scalar eta i. This scalar we've now identified as this sure complement. That's the m part in Schur's notation for the inverse of an updated matrix. And then we update the two estimates, z i and alpha i. And what I've removed here is the bit that previously gave us this lower triangular matrix in Joleski, which we needed to solve problems afterwards. I'm just not going to return that anymore because for Gaussian process regression, maybe we don't need it anymore when we have these two things. They are sort of equivalent representations. Now, numerically speaking, things are a bit different because this object c i is less numerically stable than the lower triangular decomposition. So if we really cared about how precise the computation is, yes, then we would do this differently. But keep in mind that we're aiming towards deep learning. Nobody cares about numerical stability in deep learning. Weirdly, maybe they should, but we don't. So we just have to be honest that if we're trying to compete with something where everyone's a cowboy and just doesn't care about anything, then maybe we can allow ourselves to do things like this. That's it. And now we are returning these three objects in yellow and green and blue and they give us bookkeeping, point estimate and error. And we can use them afterwards to predict the function at arbitrary points in linear time. How do we do that? We built the projection matrix KXS. So for that, we asked the kernel, what is the covariance between any test point that we care about that's a function and the data that has so far been processed, S. That's why we need S because we need to tell the kernel. This is the stuff I've already looked at. The other parts of the data set I haven't even processed yet. They are still vacuous. Then we can get the posterior mean by taking the prior mean and multiplying with this thing and that operation is O of N because alpha is a vector of length N and this is a rectangular matrix of size, whatever little x is, and N. Actually, it's not even N, it's just O of N because it's at most N after N steps. It's actually O of I number of iterations so far. And posterior covariance given by prior covariance minus this object, which is O of I squared. So if you really want to, this method is the training part is O of I squared times N. The I squared term shows up in line six and this here is O of N because we just load the row, one row that is of length N. So I squared times N. If I is N, then of course that's N cubed but if you stop it early, it's just I squared times N and this thing here is O of I. O of I, O of I, O of I squared actually. So actually Gaussian processing regression isn't cubic in N. It's just worst case cubic in N and we realize that because we looked at the data. Eh, sorry, at the algorithm, at the lower layer. And this is the answer to the question why do we have to look at Cholesky well because of this? So someone actually asked during the break, okay, what does that mean for the data set size? What can I actually do? Well, so you know really big, large language models and generative models are trained on data sets that have a few billion training points, AKA the internet. Standard Gaussian processing regression in its textbook form with the Cholesky decomposition, the full one, the black box Cholesky decomposition that we have done so far, scales up to, on this computer, I can easily do it with like 20,000 data points, 10 to 20,000 and then each step takes like, I don't know, not each step, computing the posterior mean takes like one, two or three seconds, but if it's a bit more then it very quickly goes beyond reach because a cubic function just grows really quickly. So if you go to 100,000 data points you have to wait for hours. And then there's also memory issues where you run out of memory on a machine like this so then it gets a bit more complicated. This iterative procedure can scale to data sets realistically of size a million data points, something like this. So there are contemporary Gaussian process inference libraries, for example, GPY torch, which can routinely process a million data points and they also use GPU acceleration and so on. So a million data points is, it's less than the internet for sure, right? So it doesn't address anything but it's also not a trivial size of data set. And I encourage you that when you leave this university and you go to work anywhere and someone gives you a data set, keep those sort of ballpark figures in mind because I promise you a lot of the data sets you are going to work on will not have a million data points. In fact, you might get data sets that contain 10 data points. If you're working in medicine, for example, for a medical company, you typically have tiny data sets because each data point is alive and then it's not about SGD, right? And just making things converge. But it's really about knowing exactly what's going on because the nice thing about this algorithm is it's linear algebra. We know everything about it. For example, there are books in the library by people like Ilse Ibsen who have really labored over this method to make it extremely stable. So one thing people do, for example, with Cholesky is you all know from your undergraduate days that subtracting floating point numbers from each other is not a stable operation, right? There are these, basically, overflows in the mantissa, right? So what you want to do is, when you end this algorithm, to make sure that you typically only subtract numbers from each other that have roughly the same size and sort of leave the tiniest numbers to the end. So what people do is they take this, there's like a subroutine in here that you can find when you do your deep dive into the Fortran code that goes through the matrix once and checks for the sizes of the entries in the matrix. That's O of n squared. And then permutes the rows and columns so that the large numbers are top left and the small numbers are bottom right. And you can do that by permuting the rows and also the columns simultaneously, right? And permuting, as we've now realized, just means that you process the data set in a different order, right? This is called pivoting and there's a fancy thing for it in numerical linear algebra. But we're not gonna talk about it because in deep learning, no one sorts their data set, right? They just call their data loader and just hope that it works. Yes. So the question is, do we have a notion of convergence and when we stop this thing, no. And patently no. So we have one very precise statement which is after n iterations, so once this for loop ends, assuming infinite position floating point operations, this gives the exact answer. So it's done after n steps. But there is no guarantee that if you stop it early, it's in any sense close to the true solution. In fact, there are theorems from the 1950s from Hessler's and Stiefel showing that you can construct adversarial situations where this algorithm keeps getting worse in every single step up to the n minus one step and then everything collapses and it's done after n steps. If you really want to, you can build matrices that what this thing behaves like this. So actually that gets us nicely as a segue into the final section of this lecture, but as a quick question. Yeah, so the question is, if we do this on a million data points, we have to store a matrix that's a million by a million. No, you don't have to store a matrix that's a million by a million. You have to store a matrix that is a million by eye. It's a rectangular thing because you don't actually have to store C. You store C as a data structure that isn't an array. You store it as a collection of these D eyes and eta eyes. You could have a structure. You already know these structure. They are called show factor. They are these data objects that you just hand as a specific thing. You don't return a Jack's NumPy array. You return this thing called Cholesky output and it stores these vectors D and the sequence of eta eyes. And by the way, also, of course, you don't actually necessarily store this matrix K either because that's quadratic in the number of data points. There are libraries for large scale data processing that functionally operate on K. So what this thing does is the operative, the only point where K actually shows up in this algorithm is here. I mean, you can see it here again, but we can replace it here with this expression. So there's no K in here. The only point where K shows up is here and it's in this matrix vector product. So at the moment, this matrix vector product is actually reading out an individual line. So you can build that line when you need it. You don't need to build the big matrix. Of course you can because we are processing the data set one after the other, right? In fact, we already know that we're going to only apply CI to this Zi and CI is this low rank matrix. So in fact, we don't even need the entire row. We only need the first I entries of it. And then we can stop. So if you really think about it, you can make these methods very fast. And that's sometimes actually the job of a machine learning engineer. It's not a job of a data scientist. If you want to be a data scientist, you don't have to care about this. But if you want to be a machine learning engineer, you should probably care about it because that might be just your job. In the teams that train large language models, people like 30, 50, 100 people working for Google, DeepMind, Facebook, OpenAI, there's usually, let's say you have a team of 30 people. I happen to know this for the OPT team from Facebook. And there's like 20 of them are just doing hardware to make sure that the data center works because you keep having GPUs die on you and they just run around and swap out like hot swapped GPUs and they build the data infrastructure for this, loading IO and so on. And then there's 10 people, well, there's two managers and there's eight people who are machine learning engineers who care about the deep neural network and they have like daily meetings to decide to lower or raise the learning rate. So all of these people are effectively doing the job of this algorithm. 20 of them take care of line four and five and eight of them take care of the remainder. So if we could do this in deep learning, if we could somehow build something like this in deep learning, it would totally change the way that deep neural networks are trained. And that's why it's useful to think about these methods while we still can, while we still deal with models where we can do this to keep in mind what we might need to do in the next five or 10 years with deep learning. So now final step, I have written this update using this notation S i, so it's in this line four where, which looks like a sort of useless line so far, right? I've just renamed something, E i to S i. But I've done this explicitly to point out that afterwards what we do is we multiply this vector S i onto this matrix K. So the operative part of this algorithm is this data loading operation, data load, load from disk, right? Which is multiplying with the kernel matrix K. And this is really an operation, right? K is an operator and we apply it to this vector S. And by the way, so there's, for example, a Python library, if you want to look at it, it's called kops for kernel operations, which deals with turning this process into a functional operation, right? So that when you multiply with the matrix, you never actually have to build this kernel Grammatics, but you have nice function structures that just do the multiplication directly for you. So the big insight that we should now have is, ah, we're actually computing matrix vector products. That's the thing. The loading process is let's load a linear projection of the matrix that defines what we do with the inputs and the vector Y that defines what we do with the outputs, with the labels. Linear projections of the data. If you think about it, no one ever said that a batch that goes into a training process should consist of a direct, of an actual sum over individual training points. It could be a weighted sum over training points. Why do I need to build my batch as a sum over loss functions on the I've training point? Why not put in a little weight here, or actually SI, that would allow me to focus the attention of the training process on some data points that might be more important than others? Hmm, what does that mean? So Cholesky is literally just sequentially loading individual data points. But we could also load, for example, multiple data points in one step, then it would be Block Cholesky. This is actually an algorithm called Block Cholesky. But we could also load projections of the data. Right? So we could actually have something called a data loader that is worth that name, that really goes through the dataset and somehow smartly decides which of the data to load and how to weigh them relative to each other. So what I've changed in this version of the algorithm, line four is now not called SI is equal to EI anymore, but SI is something that takes, take care of data loading. An IO subroutine, something that works on the part of the the part of the data center where all the disks are or the solid state drives, the storage unit, right? That tells us what to load when. If you're training your large language model, should you first train on Wikipedia or first train on Twitter? What are you, like, which order, right? Or maybe a mix of them too, so that the thing that learns the whole thing. So what would be, from the linear algebra perspective, the best directions to choose to solve the problem really fast? We're gonna multiply the matrix with it. So what's your knee-jerk reaction to, okay, I need to approximate a matrix K really quickly. What's the best basis to write K in? Principle components, yeah? So principle component is already the stats term for it. What are we computing in the linear algebra sense? Eigenvectors, yes, exactly, very good. So if we could magically choose the projections to be the eigenvectors of the matrix, then every individual observation in line five or so of our algorithms, so the projection K times S, would be just returning an eigenvalue. It would just tell us what the eigenvalues of that matrix. And then our updates to the inverse would be sort of trivially just UI because all of the previous projections we've chosen would be orthogonal to the new one. So we wouldn't even have to compute this line we could just forget about it. We would, this line would just be UI. It would be, it would go away. Wouldn't cost anything anymore. And this sure complement would actually be just a sequence of one over the eigenvalues. And our approximation to the matrix inverse would just be a sum over the eigenvectors times one over the eigenvalue, which is actually the sort of PC80 composition if you like of that matrix or actually a better way to think about this is it's the best rank I approximation in Frobenius norm of this matrix. And so if we manage to sort the eigenvalues by size, we start with the largest one and end with the smallest one, then this is the fastest in L2 norm. Any iterative algorithm could converge on as an approximation to the inverse of the matrix. It would rapidly expand exactly the directions that we need. The only problem with this is and you don't know what the eigenvectors are, right? If you did, you wouldn't need to do any of this. You just project on them, you divide by the eigenvalues done. Question. Yes, good point. So so far I've argued that this algorithm Cholesky it's O of I squared times N. Now this would become O of N squared times I. So we have to project on to they have to do this first line K times S which is a now a product between the matrix of size N by N and the vector of size N. So that's O of N squared. So each of these iterations would become O of N squared and we do I of them. So after I iterations, it's again N cubed. Fine. But we also gain something because this will converge much faster. We might be able to stop this after two or three steps rather than after a hundred or a thousand or 10,000. And then maybe that's worth it. Depending on what the eigenvalue spectrum of our matrix is and how our data are distributed relative to it. It's just expanding the matrix in a different basis and each of these expansions could be better of us. But before we get too deep into that, there's still a problem which is that we don't know what the eigenvalue decomposition is. So we can't do this. But thankfully we're doing linear algebra and it's not some fancy new technology that was invented, I don't know, 10 years ago and has been like rapidly developing but it's something that has been done by humans, I don't know, since Laplace or Gauss and so on and very smart people have thought for a very long time about how to do these kind of operations extremely efficiently. And one of the coolest one among all of them is this guy, Langschos Cornell, Hungarian mathematician who had an absolutely crazy career. He was the personal assistant to Albert Einstein. He had an offer for a professorship in Frankfurt. Unfortunately he was Jewish and it was right before, right after the Nazis had taken over power in Germany. So he didn't decide to stay in Germany. So he left for the US, stayed there for quite a while but then the US became a bit difficult for people like him because McCarthy thought that he was a communist. So he had to go back to Europe but he didn't want to go to Germany. He went to Dublin where Evan Schrödinger created a professorship for him. And I think not so long after moving to Dublin he died in 1974. And he invented some of the coolest stuff about linear algebra. He is the one that could best claim to have invented the singular value decomposition. He figured out that any matrix can be represented as the product of two orthogonal matrix and a positive definite matrix. And he also invented this algorithm which is called the Langschos algorithm or the Langschos iteration which is a beautiful way of constructing almost an eigenvalue decomposition. And we don't have enough time to go through it and figure out exactly what it does but I'm gonna explain to you sort of conceptually what this method does and these lines are actually what it does. So it's a way of constructing an almost diagonal decomposition and almost diagonalization of K. What we're going to try and find is a sequence of vectors S also our notation S that turn K into a tri-diagonal matrix. So zeros everywhere except for the diagonal and the principle of diagonal. Why is this cool? Well because this is essentially an O of N object and once you have this diagonalizing is trivial. You won't immediately know how you can think about it on your way home if you like but you can imagine that there is something akin to back substitution that allows you to diagonalize such matrices. This is called the Thomas algorithm and it doesn't matter but you can do, you can imagine that this is already O of N so it's probably O of N to make it diagonal. So how does this algorithm work? So on an algorithmic level it starts out with an initial guess for S. So you typically choose a random direction. This algorithm by the way is implemented in scipy.linarch.sparse.ich Symmetric commission decomposition no eigen sparse permission it does essentially this. You start with a random vector that you draw at random that's S zero and then you repeatedly evaluate K times S that's line two and also line eight and then do something very smart to make sure that you get a new direction which sort of removes the action of all the previous directions except for the immediate predecessor and it's such that the remainder is pointing in a new space sort of it's pointing orthogonal to all the previous directions. And it turns out interestingly that one can do this without looking at all of the previous directions but only the current one and the last one that's this final line 10. Another way of thinking about this algorithm is that it's essentially Graham Schmidt orthogonalization starting with S zero. If you've done Graham Schmidt before in your undergraduates you might remember that there's this process of creating orthogonal bases that you might even have done in high school. It's essentially this but very smartly written down realizing that some things don't have to be computed because you already know that the previous directions are all orthogonal to the previous ones. So there's some kind of recursive smart thing going on. And so one thing that I should point out is annoyingly you shouldn't actually implement this method in this way. That's just the textbook format. If you do it won't work because this is an extremely numerically unstable operation. It just goes haywire after two or three iterations. So when you actually implement this people have to do a lot of smart tricks and backstabs and that's why we don't look at the Fortran code. But in principle this is what it does. And there is another fun observation that people already made in 1955 faster than Stiefel who said, oh, so there's this special vector S0 up there which is this initiation of the problem. And Langschild says, oh, just take whatever. Take the first, a random direction. But actually we could think about what the best direction would be. And we're doing Gaussian process regression. We want to minimize our residual. We want to minimize the distance between the posterior mean and the actual posterior mean, right? So we could guess a good first direction would be the residual, the distance between the data that we have observed and our current estimate for the posterior mean. If you do that, then this algorithm is called conjugate gradients. That's it. That's the entire thing. If you've heard of conjugate gradients or an optimization lecture, that's it. And we can implement it in a reasonably stable fashion and then it'll look like this. So we start with the prior. And then in the very first step of this method, it goes through the data set once, multiplies with the chronogram matrix, computes the residual, those are these red lines. They are standardized to length one. And then it updates the entire posterior mean in one step like this. And this is our blue estimate. In the background, in black, you see what we're aiming for. That's the ground truth after all data points. And you see that this is much better than the first step of Joleski. It's already very close to the truth. You could even stop here. And it would be a decent estimate, right? It's not perfect, but it has a lot of the structure already. And then in the next step, we get very close, right? Now with two steps of gradient, you're basically done. And now the method just keeps sort of fiddling with the tiny bits. It just does some refinements very, very, very carefully. All the way until everything. You see that on the left, things are now pretty good. Here, over here, we're not good yet. And maybe this is mostly good. Here is a bit missing. So we see a gradient here. Next line, final bits get corrected. Now we're almost over here, it's a bit bad still. So that's gonna happen in one of the few final iterations. Bam. And we're done. So wouldn't it be nice if this is how deep learning worked? Like a normal computer. Like we're used to how computers work. You just print the button, you walk away, you come back, it has done its job. That's not how deep learning works these days. But Gaussian process equation can work this way. And I want to appreciate this as we move in the course towards these more hacked together models that everyone is now currently using in the industry. So instead of loading individual data points, we could also load linear projections of the data. And we pay a bit of a price for that because we need a matrix vector product in there rather than just loading a row of a matrix. But we also gain something, which is very rapid convergence. And by the way, for this algorithm, there are beautiful convergence results that show that it converges very fast. As a function of the eigenvector value spectrum of the matrix, you can imagine that pretty much if the matrix is diagonal and all the eigenvalues are the same, then this thing is not going to help you with anything. It'll still have to do one direction after the other. But many settings, the typical setting in machine learning is that you have problems that have some dominating eigendirections, which basically describe the entire thing quite well. And then there is tiny bits of corrections afterwards. And conjugate gradient finds an iterative procedure that starts with the biggest corrections first and then does the other ones later. So by finding such, by constructing such directions, we can actually converge very fast. And I will leave out this bit and do it on next Monday. And let's just show you this final algorithm, which is our adapted version of integrative linear algebra, specifically for Gaussian process regression. An algorithm that starts with an initial guess. And by the way, you can hand that initial guess to the function. It could even be something else other than zero, could just be an optional argument. Then this would be called a preconditioned version of this algorithm. And it produces, iteratively, all the things that we need to do Gaussian process regression. And it's just a rewrite of your low-level numerical algebra layer, specifically optimized to do the things that we actually need. So in least squares regression, in Gaussian process regression, learning, training the model is linear algebra. So if you want to understand how expensive it is and what its complexity is, we need to look at the linear algebra. We've realized that the numerical methods that do this, they are just data loading and bookkeeping algorithms. They effectively load the data set in a smart way, either in a numerically stable way or in a maximally informative way in the case of conjugate gradient slash Lanshaw's iterations. For Gaussian processes, there is no real separation between computation and learning. The data is a vector and a matrix. And operating on that data and vector, on that matrix and vector is learning. So the computation and the learning are the exact same thing. In deep learning, the separation will be a little bit more tricky to figure out, but it's just a generalization of this process. So I hope that despite the dryness of this topic, you understand that it's very important to understand if you want to make headway in a field that is rapidly developing. Please provide some feedback. On Monday, I'll very briefly do the final two slides of this, and then we will talk about what we do if the data is not real valued observations of a function. Thank you.