 Let's get back to where we were last Thursday. We had started with sort of ended this first part of the course on the foundations of probability and tractable inference with exponential families and we focused on the more sort of bread and butter machine learning applications that you might think of as the typical AI setting supervised machine learning. Can we learn functions with this sort of Bayesian formalism? Well we were interested in functions that map from some input space and initially just to the real line we'll get to other domains later. Notice that the input space interestingly already is very abstract it's just x because it could actually be anything. Why? Because we started the first thought that we had about this function was well how can we represent a function? We started with this linear function with the simple data set and then the non-linear function and we realized that maybe we can write functions like this where we implement some feature function phi of x that takes the input x and does something with it and just returns a bunch of real numbers. So one nice side effect of this that I didn't discuss last Thursday is that we don't really have to care about what x is. It could come from any kind of weird space. Now in the plots that I showed you of course it was the real line because it's easy to do plots in 1D but it could really be anything because there's this function phi that maps around it. And since everything afterwards only happens on the output space of phi, x could be multi-dimensional inputs, it could be images, it could be strings, proteins, words, form a text and so on and so on. So once we've applied phi to it we have a bunch of real numbers and then we're in business. And we decided that we're going to write these functions in what is called a parametric linear form. Linear because this function is now a linear function of the stuff that we care about the w, the weight, the things that we claim not to know that we'd like to learn. So linear and parametric because there is a finite collection of w's of weights that describe the function. They also span a space of functions. And then we discovered that this is enough actually to apply the kind of algebra of Gaussian inference. So here's our first heavy math slide. Let's stare at these expressions for a bit and figure out again what was happening. So this is by the way all the stuff that we managed to encapsulate into our nice Gaussian abstract data class so that we didn't have to think about it too much but it's sometimes useful to look at the math anyway because you can learn something from it. So we decided to learn this function f which we wrote in this parametric linear form and then did Bayesian inference by basically using the properties of exponential families and the fact that the Gaussian is the conjugate prior of the Gaussian with unknown mean and known covariance. So if we assume that our data comes from this likelihood model, this generative model, so we observe pairs of x and y where y is supposed to be an evaluation of the function x corrupted by Gaussian noise. That's sort of the scientific lingo where it is noise or the likelihood is Gaussian. In particular it's Gaussian with a unit covariance that makes things simple. Actually everything I said would also work if the covariance is more general if it's just some Gaussian distribution. Because you can basically see that this sigma times one just shows up in a bunch of places here. If you would replace this with another matrix, you could just have to replace the sigma times one anywhere. We haven't made use of this so much yet. This is our likelihood. The corresponding, so if phi of x can be written in this linear form, the corresponding conjugate prior is this Gaussian over the weights, this is Gaussian distribution with a general mean and a general covariance. Then we can compute a conditional distribution for w given y also known as the posterior over the weights given the data. For that we have to look up this nasty piece of linear algebra, but it's just linear algebra. The expression looks like this. That's a Gaussian posterior distribution which has a mean, that's a vector, and a covariance that's a matrix, a symmetric positive definite matrix. They have this particular expression which involves all the terms of the prior and the likelihood in let's say linear algebra form. Not in linear form because there's some inverses in there, but it's linear algebra. We build matrices, we solve linear systems of equations. We could rewrite this with new parameters and say the new, like this posterior over the weights, that's also a Gaussian, so that's conjugate prior inference. The posterior looks like the prior with a new mean and a new covariance. Let's look at this posterior mean for a bit and this posterior covariance. The new posterior mean can actually be interpreted quite sort of intuitively. We started out with some point estimate, mu for the weights, some initialization, and after having seen the data, the observed estimate is the old estimate, this thing, shifted by some term. What is that term? Well, it says, tell me what the data is, y. Get some track from it, what we would have predicted y to be under the prior. So the prediction for y is because y is an observation of f and f is a linear map of the weights. We can just apply the linear map and say the prediction for the y is phi of x, capital x, the test points, the training points, times mu, the initial prior mean. And I'll subtract those two from each other, that's called a residual. So how far is what I observed from what I thought it would be? And now I have to ask, well, how far did I expect it to be from each other? Maybe I was very uncertain and I expected them to be very far from each other. Maybe I claimed to be very confident and then they should be very close to each other. So the scale on which I am wrong is given by this matrix, yes? You mean this and this? So this thing. So you can think of this as a signal to noise ratio would be a ratio between a mean and a standard deviation. Here we actually have an observation. So signal to noise ratio is maybe not the right word, it's more a scaled residual, a standardized observation. Ah, so the left hand side. These two. You can think of these two as a signal to noise ratio. So the word that is usually used in this community is called gain, which is a bit like a signal to noise ratio. So this is what, and we'll get to the signal processing view on this in a bit, in signal processing this is called the gain. It's how much you should update your estimate and then multiply it with the observation. So if this number is large, then we're going to strongly respond to the data, right? And if it's small, we won't. But another way to think about it is from the other direction. Here is how far the data is from what I thought it would be, scaled by how far I thought it would be from what I predicted. And so this is like a scaled observation and then I multiply by the covariance between the thing I care about, the weights, and the thing that I get to see why. So between the latents and the observed. And yeah, so maybe that's a good interpretation. The covariance, the most important thing in light also of the question that one of you asked last Thursday, is to notice that this expression looks a lot like the one up here. And we'll get to that in a moment. Before we get to that though, I want to also point out that once we have a prediction over the weights, once you have a posterior distribution on the weights with these parameters. And of course we naturally also immediately have a associated prediction, marginal prediction over the function. And how did we get that? Well the function is a linear map of the weights. So necessarily, or like automatically directly, the prediction for the function is also a Gaussian distribution over the values of the function at any location little x, which has a posterior mean and covariance, which are given by taking the posterior mean and covariance of the weights and just applying in the case of the mean, the linear map from the left and in the case of the covariance, the linear map from the left and the right. Oh, and by the way, one confusing thing that I'll keep playing with today is the notation because it's really at the same time confusing and also very useful. So I'm going to use these subscripts for arrays because they make the equations way shorter. So you could think of this year that's phi, the function phi evaluated at little x, any point x. And just like in a Python function, when you hand this thing an array, it returns an array. So if x is a list of 300 entry points, then this will be of size 300 by whatever number of features I have. So if I have 20 features, it's 300 by 20. And here, this is a capital X, that's the training points, the number of locations that we've observed. And I realize that little x on capital X is difficult to separate from each other visually. And after this slide, I will try and replace the little x's with just a big dot, a bullet to say it's just a function that is open waiting for input. OK. So the most important thing to notice from these two equations up here is that this expression for the posterior covariance actually contains almost completely the stuff that we have in the mean. In fact, we can think about this a little bit. And one maybe very explicit way of conveying how easy it is to compute the posterior covariance is that if you look at the posterior mean again from the previous slide, then one way to talk about this is to say, well, this is a linear function of y, so y is just here at the back. It just gets multiplied with a matrix. So one thing I could do is I could compute the sensitivity of the model to y, so the derivative of the mean with respect to the y's. That's a Jacobian. It's a matrix that contains all the d prediction at x with respect to d observation at y. That Jacobian is actually this green thing in here. That's what you just called the signal conservation, but actually which is more like a gain. Now that gives us a name for this thing. And we can write clearly the posterior covariance with it because it shows up in here as well. So what does this mean? Well, first of all, this means that we can clearly compute this quantity because it's the derivative of our posterior mean. If we can compute the posterior mean, we can compute this derivative with respect to the data. This used to be a big observation, but in your generation you're all aware that auto-diff is a thing, so we can compute derivatives of functions at low computational cost. So if we can compute the mean, we can compute this object, this Jacobian, at cost that is linear in the number of observations. And therefore, we can compute the posterior covariance at cost that is linear overhead over computing the mean. Now computing the posterior mean is actually cubic in the number of data points. We observed that a few times already because we need to solve this matrix. But once you have it, the extra overhead of being uncertain is linear. So that's the answer, maybe the first technical answer to the question that one of you posed last Thursday. Why do we, like why, it seems like a bad trade-off to be uncertain if it's so complicated to do. No, it's not complicated to do. We just spend a lot of time thinking about it. But it's not computationally hard. If you can build a point estimate, you can build an error estimate. Why? You just take the gradient of the mean and then you do something with it that is just applying it to a matrix. What actually is it? Well, we start with the prior covariance and then we subtract this covariance times this Jacobian. So what this representation down here sort of says is that the posterior covariance actually is the sensitivity of the model. So it tells us how much of the overall capacity of the model, sigma, has already been extracted out from the data. Once it's zero, we're done. We can't learn anything anymore. The model has become stiff, frozen. And while we're learning, we keep extracting information from the data that reduces the uncertainty. And that's what the posterior mean actually is, then. It's our point estimate, which arises from the data. And if the posterior mean is still flexible at some point, if you could add a new data point and it would change the mean a lot, then you see this exactly in this expression. OK, by the way, am I getting too loud again? Doing math, so it's getting louder? No? OK. So the other observation to make is that this is really just, we're just trying to make the connection both to statistical machine learning and to the deep learning class. So in statistical machine learning, you did nonlinear parametric regression early on, and it just maybe flew by. It's like, oh, whatever, it's a simple thing. Well, let's do the complicated stuff. And you'll spend a bit more time on it because it allows us to really think about uncertainty in a precise form. We saw that we can be uncertain for free in this particular setting, where for free for me means at a complexity that is of lower polynomial order than the cost of computing the mean. So once you have the model at the point estimate, you can also construct the uncertainty. And that is really the viewpoint that I want to get across over the entire semester, that constructing point estimates without error is just a bad thing to do. And most of deep learning is like that. You just return a point estimate. You just get GPT tells you what it thinks. No qualifications whatsoever about whether it's good or bad. And what I'm trying to get across is that that's not necessary, because you can compute the error in a tractable fashion, at least so far in these simple models. We haven't actually done deep learning yet, but so far we've just convinced ourselves that this is not expensive. So maybe by the way, let's think a little bit about the connection to deep learning. So I already teased this a few times, and we will keep talking about it for the entire term. But let's do it for once for this particular setting very clearly. So what we did last Thursday is this Bayesian inference on the functions. Here it is up here again. And we already sort of saw that we can draw a picture like this. So that's already very suggestive that there might be some neural network going on here. We take our input x, we construct some features. In the initial example, there were just two features, right? A bias and a linear term. And then I said, well, we can think of some other features being computed. And then we played around last Thursday with sigmoids and steps and reloads and all these other features that we came up with. And actually someone also wrote in the feedback, it's like, that seems really complicated that I have to choose between so many different features. That's true, it's kind of annoying, but it's also useful to first realize for yourself that you are not constrained to using reloads or ton edges or gautions or whatever. You can pick any feature you like. Really any. In particular also a trained deep neural network could come in here from below. And then we add these weights on top, that's the last layer of our network. And now what, how do we actually train this network? Okay, let's think about what we've done. So we've constructed prior times likelihood divided by the evidence. That's Bayesian inference, right? Base theorem here, prior times likelihood divided by evidence is posterior. And now one way to think about this is the one that we've already done in exponential families. Let's find the mode of this expression. Because it's a Gaussian distribution, right? And for gautions, the mode is equal to the mean. So maybe let's find that mode. So we take the logarithm of this expression. Why do we take logarithms? Because logarithms are monotonic functions. And therefore they don't shift the location of the mode. But they make the computation much easier. Because all the exponentials in the gautions go away. So that means we get rid of this normalization constant. Nice. We get rid of the exponentials in the gaution. And we're left with the, well, first the log posterior is equal to the log likelihood plus the log prior. But typically in deep learning, people talk about minimizing something. And here we would maximize something. So let's find a minus in front so that we can talk about minimizing, right? Because maximizing something is like minimizing minus the same thing. So we write a minus in front. And now we just write out what the likelihood and the prior actually are. So gautions are e to the minus quadratic form. We have a minus in front, so the minus has gone away. And we're left with likelihood is equal to, well, that's what this product form of the likelihood means. It's minus the sum. So the product of exponentials, the logarithm is the sum of the terms in the exponential over the quadratic distance between y and phi times w. So for those of you who have had a deep learning class, this is called the L2 loss, least squares loss, or a square loss. And then we have this prior, which is, well, it's this quadratic form. It's w minus mu transpose sigma inverse w minus mu. By the way, this is not on the slide. But what would happen if I set mu to 0 and sigma to the unit matrix? Because that's something people like to do anyway, right? That's the standard prior. What would that be? L2 regularization. So this would be, so if mu is 0 and sigma inverse is 1, then we just have w transpose w, also otherwise known as this. So what we have back here in the special case of prior with mean 0 and covariance 1 is L2 regularization for a neural network. It's weight cost. You're just saying, I don't want the weights to be large. Let's regularize the weights towards 0. And depending on which community you come from, either you're disappointed by this. It's like, why do I have to go to this big page and de-learn my machine learning class if I end up just doing the same thing that I've already learned to do in deep learning? Or if you come from a patient perspective, you're like, why do people in deep learning do things so like restrictively? Why do this if you could do this? You could have a prior. Maybe I know something about my weights. And I would like to put that in. Wouldn't it be nice to have a prior or a neural network that actually contains some useful information? Turns out that's something people actually do. For example, they build stuff like geometric neural networks, which basically amount to doing away with this and replacing it with something more complicated. So we're actually doing deep learning. Do we really do deep learning? Now the model so far looks like we're doing deep learning. Are we actually computing something? We get in deep learning as well? Let's say we train this deep neural network, which is a loss function over this architecture with these weights, with this data. Let's compute the gradient because then we could do SGD, Adam, whatever. So the gradient of this expression, you can work out for yourself is this. I'm not going to do the derivation for you, but it's just linear algebra. So you can do it on a piece of paper. You basically, the trick is, well, after a while you might see it, but if you don't see it immediately, you can write all of these things as sums over individual elements. And then you can take derivatives with respect to elements. And you can convince yourself that that's what the gradient looks like. That's now we expand these brackets. So we get this expression. And this is a function that is linear in the weights, our gradient, with this term in front, and then linear in the data and a bunch of constants. If you want to set that to zero to find the mode, we can solve for w. And we find an expression that if you have stared for long enough at the previous slides, you might recognize as what? It's the posterior mean. It's just the other form. Actually, I had on a previous lecture, I had the other way of writing this matrix like this using the show complement. This trained neural network is the posterior mean. We just found it in closed form. Ooh! No atom. Just closed form. Directly. And you saw me do that last Thursday. And it took milliseconds. There was no call to add them and a long decay and then a learning rate decay and then updates and resampling and five epochs. We just went through the data once, boom, estimate. And what's the Hessian of this loss? Well, it's this. So this is still, if you take the derivative of this thing with respect to, sorry, with the loss with respect to w again, then this is like taking the derivative of this expression with respect to w. And that really just gets rid of the w. So this is our Hessian of the loss. And that Hessian is the inverse covariance. It's the precision of the Gaussian. So let's read the sentence like this. What we did last Thursday. Gaussian parametric linear regression means training a single layer neural network with quadratic loss, like just the last layer of a deep neural network. Or deep learning, no, shallow learning is equal to, if you have quadratic loss, Gaussian parametric regression, where the Gaussian form really gave us this L2 loss. The Gaussian assumption, the likelihood, is like fixing the loss. So either you think that's boring because it means that you could have just done deep learning and you already maybe know how to do that, or you think that's cool because now you know why you're doing deep learning this way. Because you're actually defining a likelihood and a prior over the network. You just never realized that there was this prior and likelihood, but they were always there. You just set them to something. Yeah. Yeah. So the question is, would a deep neural network actually arrive at this result if I trade it with Adam? No. But that's not our problem. That's the problem of Andreas Geiger in the deep learning class. He has to think about how to make sure that this works. So yeah. Once you train with an optimizer, things get approximate. And if you're lucky, you end up in the right point. Now you could actually use a classic first order optimizer that is the stochastic. So you could use gradient descent. And then asymptotically, this algorithm would actually converge to exactly that point. Just really, really inefficiently. Because we know how to do it with linear algebra. And linear algebra is so much more efficient. So what this is, is it's least squares estimation. It's the good old statistical method of least squares estimation. And here's the connection to statistical machine learning again. If you thought least squares is easy, and it's sort of done, it's like from, well, actually, it's from 17 or something, Gauss and Laplace and so on, then why should we still keep doing it? Because everyone's doing it. We're already doing deep learning. So maybe we should think about what that means in terms of uncertainty. And given that we just saw on a few slides ago, that computing the uncertainty isn't so complicated because it only involves computing basically a Jacobian of the mean. And we know how to compute Jacobians of a loss function. Maybe we cannot be uncertain for deep neural networks. We'll get to it for deep nets. So far, it's just shallow nets. And I will keep confusing shallow and deep. Because, of course, shallow networks aren't really a thing. It's just a deep network with one layer. So at this point, we've also given interpretations for both the mean and the covariance. The mean is the least squares estimate. The covariance is the Hessian, actually the inverse Hessian, of the loss function at the mode. By the way, is it the Hessian of the loss function at the mode? Really, no. It's just the Hessian of the loss function. Why? Because the loss is quadratic. So its Hessian is constant. We can evaluate it wherever we like, even at the prior. As long as we put in all the data, it doesn't matter at which point w we evaluate the Hessian. And this, by the way, gets back to this question of why uncertainty doesn't react to y. Because there's just no y in here. You had a question. Oh, so if you want to change the question, is what if I want to change the loss function? Then this very efficient algorithm of finding the mode won't work anymore. And we will need to use some other algorithm. And then, in fact, we'll eventually get back to first-order optimizers at some point, once it gets really complicated. The other way to think about this is if someone uses a different loss function and deep learning, that probably means they use a different likelihood. What kind of likelihood do they actually use? You already did a homeboy exercise discovering that the logarithm of the Dirichlet distribution has something to do with the cross-entropy loss. OK, so let's maybe keep thinking about this. Any more questions? Yeah? OK, so the question is, I just said something like I can replace cubic with linear. Let's be very careful. What I said is, if you have computed the mean, which is cubically expensive to do, if you use pedestrian linear algebra, if you just call psi pi dot linear dot jolesky, that would be cubically expensive. Once you have done that, once you have this row, this mu y, you can then take this function that implements this and compute its Jacobian. And computing the derivative of a function is as constant complexity in the cost of evaluating that function. So the overhead on top of it will be linear for the data. What I'm saying is, being uncertain doesn't make the problem harder. The problem itself is still hard, but well, polynomially hard, but we are not going to add anything that makes it significantly more expensive by being uncertain. Yes, we are. So the question is, aren't we already computing it? So we don't have to call autodiff again. We're just observing that it's like using autodiff. Maybe final question on this. OK, so I said you could use autodiff. And yes, of course. So you shouldn't actually. You shouldn't compute the posterior covariance with autodiff, because your typical autodiff framework will not understand all the efficiency that could be leveraged from this algebraic structure. You should actually do what we have in the code examples. But what I'm trying to make here is the observation that computing the error estimate being uncertain could be constructed using autodiff. And because we know that autodiff is potentially efficient, well, not just potentially, it's fundamentally efficient in terms of its complexity overhead over computing the function itself. Therefore, being uncertain can't be expensive. And this is important because people keep making this assumption that point estimates are easy and uncertainty is hard. So now we're getting to the computational question of how expensive is all of this? Like, how expensive is it to get the point estimate? Could we do it with gradient descent? Could we do it with that? And we'll talk about this in a later lecture. So what we've done so far is just linear algebra. It's useful to think about this from a linear algebra perspective because linear algebra is actually super efficient. It has been studied for the last 100 years and it's basically the most efficient class of algorithms we have available on computers. So the question is, what would regularization look like if you use a general covariance matrix? Well, it would look like this. And to get an intuition for what it actually is and how you would use that, let's wait for a bit. And I'll talk about this when we get to the fact. There will be about four lectures on what this all means for deep learning later in June, late June, probably. And then we will have concrete examples of why you might want to have that. Maybe as a sneak peek, one reason why you might want to have this is, a, you know something about the structure of your problem. You know that you're learning to solve a differential equation, for example. And you would like to make sure that your model knows that there is a differential equation that constrains the weights to each other. That's called geometric deep learning or whatever, or neural ODE's and so on, physics informed neural networks. Another reason why you might want that is because in practice, in deployed applications of machine learning, you tend to get data after you've trained the model. So you want to update the model in deployment. And for that, you need to make sure that you don't forget your training data. If you just keep using SGD with no weight loss, then your model will just forget. It's like Alzheimer's deep learning. It just keeps learning new data and the old stuff falls out of the model. If you'd like the model to tell it, if you'd like to tell the model that you already have, there's some parts of the weight space that you better not change because I've already trained on them, this sigma is the matrix where it's stored. Okay, but now I want to spend the rest of the lecture talking about maybe deviating from the deep learning view and first see how far we can get with this linear structure. Why? Well, one reason why that is interesting is we just saw that we can use linear algebra. That's very efficient. And we can train these shallow neural networks in closed form. Is that maybe a really useful property to have because it gets us away from Adam and SGD and shampoo and floof and all these new optimizers and gives us this beautiful uncertainty structure, maybe we can see how far we can get with that. Another reason, so that it means we are going to add more features to this layer to make it broader and broader rather than adding layers below to make it deeper and deeper. Another reason to think about this and why we might want to do this is something that I brushed away in last lecture and the lecture before when we did parametric regression. So I said we're going, what we would like to do is supervised machine learning. So we are going to get pairs of X and Ys and we think that there's a function between them that we would like to learn. And then I said, but you know, gaussians, we want to use gaussians and gaussians have these commens vectors and covariance matrices and they are arrays and functions are sort of infinite dimensional, right? It seems like we're not allowed to do that. So we were a bit lazy and just said, okay, let's somehow squeeze the problem into a bunch of parameters because we just feel awkward talking about functions. So that's why we decided to use this parametric form of F with weights and features. But really, you're all computer scientists, at least you're probably now that you're sitting in this class, I think you, at least you could think of yourself as computer scientist. And so you're probably like functions, right? Functions are kind of cool. It's the reason why at least you're interviewing in the very, very first class we sent you to is functional programming. Because they have really beautiful objects. And you've learned over the course of lots of undergraduate classes that you can actually do a lot of things by just thinking about the structure of functions without actively evaluating them. There's all these ideas of laziness and currying and beautiful structure of operating on functions. Maybe couldn't we do something like this? Wouldn't it be nice to have a probability distribution over functions, which is like the generalization of this thing. So it has a mean, which is actually a function rather than an array. So it's an object that you can call. And what you do is you hand it some input and then it returns evaluations of that function. And then it has this sigma thing, which it will call K, which is a function that you can call on a pair of inputs. These are like the elements of the matrix, right? So a matrix has entries i, j. You can call this function on entries i and j. And then it'll return the value of that matrix at that point. Or actually it's like a broadcasting function where I can call it on a bunch, like on one vector and another vector. Let's say that vector has size M and that vector has size N and this will return a matrix of size N by M. Actually, yeah, that's not a good dog string. It should be NM will be R to the NM. That would be nice, no? Just purely for the beauty's sake. So how would that actually work? So this is just an abstract thing, right? It's not like we should write it down like this. It doesn't do anything yet. But let's think about what we actually need from functions. So just like in functional programming, it's nice to be able to operate on functions. And then you probably learned as all these jokes about how Haskell programmers don't interact with the real world and I always nasty in functional programming. The end of functional programming is always that you eventually need to evaluate the function. After all the beauty of like currying and partially evaluating stuff lazily, eventually you have to start at the very end and put a number in and then the whole thing kind of wraps together and you get an output. What does that mean for our Gaussian? It means that eventually we will say, well, what is the thing with the function? I want to know what the function actually is at this point, at this point, at this point, at this point. And so I will actually call these objects and then what they will return is something like this, a Gaussian distribution. So what I actually want from this object is that it has this constructor, which you can then evaluate on a bunch of points X and then what it internally does is it just calls all the functions. It just ends all this functional programming business and just calls everything. So it calls the mean and that mean returns a vector and then it calls this K function which returns a matrix and then that thing is a Gaussian distribution, boom. An object like this, and that's a point I want to make before the break, is called a Gaussian process and hence up who's heard the term before a Gaussian process? That's two thirds of you. So some have not and some have. For those of you who have heard about it before, is this a useful new way for you to think about it or isn't that the way you've always thought about it? Ah, I don't really know. Okay, so this is what we would like to have. The ability to just call all these functions and then no matter what the inputs are, they always return a Gaussian. So it seems like we're sort of done here, right? Because this abstraction is actually all we need. It's just this call function. That was the whole business. To be able to reason about functions, you just say, well, there's this function but I don't know yet where I'm going to evaluate it. Let's wait with the evaluation as long as we can. Let's operate on all these objects. And then eventually I tell you that I want to know the function here, here, here and also here. Those are called X1, X2, X3, X4. And then this thing will just make sure that it evaluates the mean and the covariance which is a four by four matrix that we can sample from and we can plot the diagonal square root elements and then we get this kind of arrow bar. I'll show you code examples in a moment and then we'll see how we get there. The only thing we should make sure that actually works is that this is actually a Gaussian distribution. So we need some kind of assert statements in this call function or actually in the constructor that they won't actually be in there because it will turn out that this assertion will be very hard to validate numerically but you could think of, you would like to have an assert statement that just makes sure that when you do this, when you do this call function, you actually get out a valid Gaussian. So what properties do these mu and k have to, m and k have to have for this to be a valid Gaussian distribution? Yes, so for everyone to hear, m can be anything, m is just a vector. So it can really be, it can even contain infs and nones if you really want to. We would just have to be really careful with dealing with it but it could be pretty much anything. We don't really need an assert. But this thing, this k, this has to be symmetric positive definite. So no matter where we evaluate, this always has to be symmetric positive definite. Why? Well, Gauss himself already told us that this is the one thing we need. So, Ferner sieht man leicht, dass k notwendig negativ sein müsse damit omega in der Tat ein größtes Werten könne. So we notice that the expression in the exponential has to be negative something because otherwise this function will not take a mode, right? It won't look like a bell. It'll look like something that goes up and something goes up towards infinity. You can't integrate over and get a finite number out. So it's not a probability distribution. It doesn't normalize to one. So we need our k for Gauss to be negative or actually for us that means because we write the minus outside of the expression in 2023, k has to be positive definite. And objects that have this property that no matter where you call them, they always return a positive definite matrix, they are called kernels. And this is where I have to ask you whether Matthias Heiners already introduced kernels in the statistical machine learning class. He has not. Nice. So I get to be first. Good, then you can be like, I mean, he gets to the, okay. So a kernel is a function that has exactly this property. Here's just the mathematician's way of writing down what we just observed. So definition, a function that takes pairs of inputs and returns real numbers is called a positive definite kernel or sometimes it's called a Mercer kernel if for any finite collection X of input points, the matrix that arises from calling our call function in this object I just constructed has the property that it is positive semi-definite. So no matter which X's I put in, like from some domain, of course, I have to assume that there are some properties, that matrix will always be positive semi-definite. Semi because we can actually deal with semi-definite, right? You saw, we can sample for example using the SVD, so it's fine. Just a quick reminder of what semi-definite means, this is it, like if you don't remember it and you can look at the definition here, but yeah, we've talked about it already. That's the only thing we need and then it's fine. And this object that comes out of it, the mathematicians then call a Gaussian process. So let M be any function that maps from the inputs to the wheels and K be such a thing, this cool thing that returns always positive definite matrices, a Mercer kernel. Then a Gaussian process, here it is again, is a probability distribution over the function F such that every finite restriction to function values is a valid Gaussian distribution. So here you see at the top what a mathematician will write and at the bottom what a computer scientist will write and you can pick any of these two which yet you like more. Here this definition is maybe much more explicit about what we're actually doing, but it misses this bit about K or we would need some kind of assert statement, but assertion is hard for these objects. And in a mathematician's definition, I know from experience having taught this class before, it can often be very, very difficult to understand what this K actually is. So the one thing people always get wrong about these definitions is what is K? And we will talk about it after the break and because now the obvious thing, the obvious question is how do we build such a K? Where do we get it from? Because so far I haven't told you how. So let's reconvene at 11 past 11, 11, 11. Okay, while everyone's getting back in, that was a really good question during the break. Where did you go? You finished. That, ah, there you are. Which was, so in this definition, here, if you look at the code in particular, the question was, where is F? This is supposed to be a distribution over F. Where is F? And I think this is a really, really good question to think about. So I'll give you my answer and then you'll probably need to take this with you home or into the lunch break and think about it. So what we're doing here is we are constructing a computer program that can produce probability distributions over function values at arbitrary points of evaluation. So on arbitrary measures, on arbitrary grids. So once I decide on which grid I want to represent the function, so let's say I've decided to be the grid to be x1, x2, x3, x4, I can now call this Gaussian process object. So I first instantiate the Gaussian process object. I do that by Gaussian process of, I've got something like gp is equal to Gaussian process of, and then I hand it some mean function and some kernel. And that gives me like a Gaussian process class. Then I can call the call function of that. So that call is just open records, close brackets of this x, x1, x2, x3, x4. So I construct this Gaussian process with this mean function and this kernel function. Then I evaluate it on this grid, x1, x2, x3, x4. And then that object by construction is a Gaussian. And Gaussians are probability density functions. So they have all these properties. If I can show you back to our code, here's our Gaussian class. And Gaussians have, for example, a covariance and a log determinant and a precision and a mean times precision. And they also have a thing called log PDF. So I can evaluate the log PDF of that. And the input to this are function values. In this case, an array of four function values if I have a grid of four points. And then the number that comes out of that, that's the probability density that discussion process assigns to those four function values jointly over this grid. Now, I would really like to show you code for this and we will in a few minutes, but we can't do it yet because we don't actually know yet how to construct such a kernel. To do all of that, I need to put in something here. And so far, we don't know yet how. So I really want to get to this today. Therefore, unless it's a very short question, no. So the question is, can I think of the posterior as a special case of the Gaussian process? That's just the wrong way round. Gaussian processes will also have postivios and they also are Gaussian processes. So the posterior of a Gaussian is not somehow more process-y than the other thing. The process thing just means it's a computer program. It's like it's a function that we hand functions to and then it does some internal cool thing. So how do we get to these functions? Well, actually we're going to do something that might seem pedestrian at first and then it'll turn out to actually be mathematically fine. Which is that we're going to go back to last Thursday, think about what we actually did in our parametric case and then discover that we are already doing Gaussian process regression. We just haven't written the code correctly. So if you think back to last week, then we discovered that we can build, design priors over functions by basically talking about priors over weight spaces, assigning Gaussians to the weights and then integrating out a Gaussian distribution over the weight, which are called linear projection. Because we are constructing a new Gaussian random variable that is a linear projection of those finitely many weights called the function. And the resulting object was this. It's a Gaussian distribution over function values. Here I'm now using this bullet symbol for open evaluation point, which has a mean, which is given by the mean of the weights multiplied from the left with this feature function that we can evaluate. And in the plots, we've always evaluated it, but we could also just say it's a function. And then a covariance that is given by this over the data at least. So here we already have a mean function and a kernel function or in Gaussian processes. This is also sometimes called a covariance function. So in particular, we can also construct a posterior using all of these objects. And what you can do is you pretty much just stare at this expression and notice that there is a bunch of points in here where we are constructing these covariance functions and mean functions. And all the terms that we care about in the prior and the posterior only contain these expressions. So in the prior, there's only this expression and this expression, in particular by only, I mean that mu only shows up in expressions like this and sigma only shows up in an inner product like this. So we could actually implement this distribution over F without accessing mu or sigma directly by encapsulating mu in a function that does that and sigma in a function that does that. And we can also construct a posterior by only functions that look like this. But actually, before we get to the posterior, let me first show you how we might do this for the prior. So I've done, by the way, so here's our code again that I've now updated on Elias. It's now more extensive. We have this Gaussian class, which is a probability density function. And now there's a new thing that I've already showed you called Gaussian process, which so far has a mean function and a covariance function and it has this call function. We will get to conditioning in a bit. So here's an example of that. I've first defined again a concrete feature function. In this case, it's the Gaussian feature function. That's one we didn't talk about last Thursday, but we could as well have talked about it. So you remember on Thursday, we first had the linear feature function or actually then we discovered that we could write it as polynomial features with just, it takes as input X and then it returns X raised to a range from zero to whatever, F or F plus one. Constant, linear, quadratic, cubic and so on. Then some people shouted out, oh, let's use trigonometric functions. Let's use sines and cosines. Then someone said we could use values with like points where there's a kink. And what nobody actually shouted out, which would have been much more a thing in like 2007, would have been, let me give you like bumpy feature functions that look like little Gaussian bells. That's another option. So I could decide to write such a feature function which takes in X and also how many features we want and what the length scale is and what the left and the right end of the where the features should be is. And then it returns, actually this is a bit complicated, we'll stare at it for a moment. It returns the input X minus a bunch of locations from minus the endpoint to the right endpoint. So here's the plot for that. Divided by L and by two, sorry, by divided by L and we take the square and take the exponential of it. So that looks like a bell shape. The reason I'm explaining it so long is the fact that this is a Gaussian function that looks like a bell is not important at all. It's just a feature function. I could have just as well used kink functions, right? Reloose or signs and cosines or step functions. But I'm going to use those bells because they're historically people like them. They're pleasing to the eye. Okay, so now I have 20 of those. I have this thing called Gaussian feature function which I can plot, so here's a plot for it. And now I can actually do this kernel construction. Can you read in the back, by the way, should I zoom in? It's good, okay. So now I can construct something called the kernel function which takes in pairs X1 and X2. X1 could be an array and X2 could be an array. And what we're going to do is we're going to implement this operation down here it is. So we take the left input and the right input. We built the features for them and then we take the inner product of them scaled by Sigma. Sigma is the prior covariance of the weights, yes. So the question is why does the constructor of the kernel take two arguments and not a set of points? So that's the beauty of broadcasting in Python. I just define a function on two inputs which you could think of as of type X, so in particular this is a real number and this is a real number in this plot, right? But it could also be five real numbers, right? And then around that there is all the stuff that we've already written, our Gaussian. And the Gaussian takes care of broadcasting if I give it an array. So here we don't have to think about it anymore. Well actually we do have to think about it a little bit because our convention is that the last axis is the what's called the feature axis usually. So we're thinking of our inputs, input space, the thing that I'm plotting X on as consisting of objects that are row vectors. So every input will be a row vector. Actually for this plot it's a row vector of length one, so that's a bit boring, but it still is a row vector. So that means its indices are indexed along the last axis of the array. So if I have like in this case 300 of these which go from minus eight to plus eight, then what I'm constructing here is an array and I'm actually explicitly doing it so that it doesn't happen accidentally. Is an array of size 300 by one. So the shape of this is 300 comma one, not 300 empty, but 300 comma one. That's important. And now this thing will do exactly the right thing. It will take the first input and construct the Gaussian features of it. And then it'll take the second object and construct the Gaussian features of it. So if I put in one lint space of length 300 and another lint space of length 10, then phi one is of size 300 by one and phi two is of size 10 by one. And then it'll say, ah, you want the times. Oh, times is a broadcasted operator. So we're going to construct a thing of size 300 by 10 by one. And then we will sum out the last axis, the one that'll be gone. And now we have a matrix of size 300 by 10. And that's the covariance between those 300 points and those 10 points. Actually, it's the covariance between the function values at those 310 points. Yes. Huh, so good question. Where is sigma in all of this? I have decided that I'm going to set sigma to the unit matrix. Because why not, right? If I really wanted to have some other sigma, what would I do? There's two options. One is I construct the Cholesky decomposition of the sigma and then I'll multiply from the right here, outside of this, right? And yeah, okay, I need to correct what I just said before but let me just finish that with the Cholesky. And then I sum again, it's fine. Or I do it before. I just define the features as if they were already multiplied by the Cholesky. I could just put it into another definition of features. So I have to be careful with the broadcast, what I just said about broadcasting because of course I got it wrong. You always get broadcasting wrong. If I call this function on a vector of length 300 by one, what it returns of course is an object of size 300 by 20 because that's what the features do, right? Because it returns 20 features. And the other one as well. So after we've plugged in the data, we now have, if I put in something of size 300 by one and one thing of size 10 by one, then we will consume the one in here and broadcast and get out something that is of size 300 by 10 by 20 and then we sum out the last 20 because that's what the inner product does. That's what this operation does. And in fact, we'll just do that. But first of all, I want to show you the plot, right? So now that I have this, this is called a kernel. I can actually define a Gaussian process. Or first I also need a mean, but yeah, the mean is annoying. We already saw that we can shift by anything. So I'm just gonna implement, let me first do that, a zero mean function. So it's just a function that takes in x and returns zero for how many x's there are. So it's really boring, doesn't shift anything. And then I can instantiate my Gaussian process and say I would like to have a Gaussian process that has a zero mean function, a function that always returns zero. And then this kernel, which has, in this case, 20 features. And then I can plot. So actually it turns out that I've given these Gaussian processes a plot function because plotting is always really tedious. You can look at it afterwards. It's on Ilias and it produces this plot. So what I've done here is I have in black those 20 features, 20 little bells, and they create a Gaussian distribution on any mesh, in particular on this 300 dimensional mesh. And you can play with this code. I could decide to only use 10 features and then I'll get those 10 features. And I could decide to only use five features and I'll get a plot with five features. And I could decide to use 50 features and then I'll get a plot with 50 features. And what you see in the back, these functions here, these red ones, they are samples from the Gaussian object that gets created by constructing this prior and calling it on the plotting grid and then just sampling from it because Gaussians have a sample function. One quick question. Ah, what does Func2's partial do? I just need this because I want in this plot to be able to do what I just did, which is sweep through the number of features. So this kernel has an optional argument called features. Oops, features. And since I want to change that, I say give this Gaussian process a function that you get by first deciding how many features you want and then returning this kernel, which then now it just has two inputs. So in computer science, I think partial is currying, which is partially evaluating the function. Okay, so now, yeah, yes. So you can call this function and in fact, inside of here, it is called with different things. We'll get to it in a moment. So really what we need is just the ability to evaluate this function. And notice that it has two inputs, which might be different, yes, right? No, so the question is, are those, do those curves, are they Gaussian? So what exactly does Gaussian mean? So are these functions, every one of these curves, actually I can make one of them. I can tell this thing to only draw one. Then we get this. Every one of these functions, and this is bad, I should change the plotting range. What did we get? Red, everywhere. Every, this is just one of these functions. And you can see that it doesn't look anything like a bell-shaped thing. It is a weighted sum of those black curves. And we will spend three lectures or so thinking about in which sense it is actually a weighted sum. But the first intuition is just, there are these black curves, and when you sum them with weights in front, w's, in particular choices of w's, then you get this curve. And in particular, if you draw standard Gaussian random numbers through, well, two different ways to do it. One is to draw, in this case, how many features did I use, 50? If you draw 50 standard Gaussian random numbers, call those the weights, and multiply those 50 black curves with those 50 numbers, and then sum them up, you get this red curve. Another thing you could also do, and that's actually what happens inside, is you take these black curves, you compute this inner product on the plotting grid, so that's a 300 by 300 matrix, you compute the Cholesky decomposition of that 300 by 300 matrix, and now you draw 300 Gaussian random variables, standard Gaussian random variables, and multiply those 300 with this Cholesky decomposition, and add the mean, which is zero, then you get this curve. So it's actually a 300-dimensional Gaussian distribution, but it only has 50 degrees of freedom, the covariance matrix is of low rank. And because we've decided to call Cholesky-Lin-Alk, SciPy-Lin-Alk-Cholesky with SVD, it just does it. If you would have called it with Cholesky, actually, sorry, since we decided to call the sampling function in our Gaussian, because the Gaussian class up here has something called sample, where is it, there, here, where we've said, please use the SVD method of sampling. It doesn't complain if you give it a Gaussian distribution that only has 50 degrees of freedom, but we're drawing 300 numbers. If you would have used the eigenvalue composition, actually that would have also been fine, because it's almost the same as the SVD, but if you used the Cholesky version here, then it would have complained and said, I'm not allowed to do that. What you just asked me to do, because this matrix here only has 50 degrees of freedom, but you want me to draw 300. Okay, so let's see how far we get with this. On the next slide, I've tried to kind of construct what we just did in math. Now we're going away from code again. Maybe math is better. So we've sort of realized that last week, Thursday, we already kind of did this thing. We just didn't implement it this way. We already computed these inner products between feature functions, which we now call the kernel, weighted by sigma. And you can think of these if you write out the linear algebra as inner products, which are double sums, essentially. So sigma has entries i, j, and for every input to the feature functions, we get out a vector of length f. And so this object here actually is a sum over f terms for this i, and then another f terms for this j. So it's quadratically expensive in f. Evaluating this thing is o of f squared, clearly, because there's two sums. We can make our life easy by deciding that we're only going to use a scalar covariance matrix. So I've decided to use sigma equal to a constant times one. I'll talk about the constant in a moment. The constant, the one matrix can also be written like this. You've maybe seen this before. This is called the chronic or symbol, delta i, j. It's just a function that returns one if i and j are equal, and it returns zero all the other times. That simplifies the notation for this sum here, and we can sort of get rid of some stuff. And the constant in front, I've decided, ah, there's a typo. So let's look at this. This is the right one. I should have just copied this over here. So the constant in front is sigma squared, sigma just being a constant, times the right minus the left end for the plotting grid divided by the number of features. You can already guess why I do that, because eventually I will take f to infinity and c max and c min to infinity, and then things will get easy. So I'm just setting them so that the rest of the computation is convenient. If we do things this way, then this inner product, because of the one inside, can be written like this as a single sum over products or feature functions with a constant in front. And that's actually what I've done here in this definition, right? We take the feature functions, and then we multiply them with each other, and we sum out the summation axis. That's exactly what this line does. It's this thing back here. Oh, yeah, that's just wrong. It shouldn't be a square. Yeah, okay, I need to fix, I made this slide 10 minutes before the lecture started, so there's bugs in there, annoying. So now here comes the one big insight. So far, this is all fine and dandy. It's just basically functional programming, throwing around lots of cool, lazy evaluation tricks. And it's maybe not so clear why we would do that. It's just abstract nonsense. And here comes the real beauty of this construction. What we've just seen is that the only thing we need to be able to do is to evaluate this covariance function K. We have abstracted away the problem of a covariance matrix in Gaussians and replaced it with just these two functions. And everything afterwards can be done by operating on these two functions. So in that function K here hidden inside is this sum. So far, the code that I've written and that you've all just asked about, it actually computes the sum over 50 terms. And now comes the hour of the mathematicians, not the computer scientists in here, who say in your first year, first semester undergraduate calculus class, Mate 1 or Mate 2, maybe, you encountered these objects called series, which are sums with infinitely many terms, but you can still evaluate them. So there are some objects in math that are infinite, but also tractable finite, where an infinite series over infinitely many terms can be written as just a number. And you study these series and they led to the thoughts about integrals in calculus. You're gonna be like, oh, interesting. So there are some sums which have finite computational complexity, even though they contain infinitely many terms. Here's just an example of such a series. It's maybe the one, it's the first one you always encounter, right? Whenever you go to a math class about series, this is the very first one you talk about. Actually, the very, very first one is the one where X is just one half and then you find that this is two, right? Okay. Does everyone remember this? Okay, because this is going to be important. So this should really, like, tickle your math bone, right? Oh, in my code, there is this sum over this neural network, which has these 50 terms. Maybe I can get away with an infinitely wide neural network if I choose the features just so that I can use one of these cool math tricks so that I can replace a sum with a number. And it turns out we can. So in this slide, I've copied again what we had on the previous slide and I've copied over all the typos as well, of course. So now we take a very specific choice. We choose our feature functions to be of this bell-shaped form. That's the feature function I implemented in the code, right? So this is the math, which is actually a much nicer way to look at it than disannoying nasty, right? So here it is again. This thing that looks so complicated to look at, it's this. It's a function that takes in X and then at the elf entry, it subtracts the location that is at CL. So that's on our lint space from minus eight to eight. It squares the distance between them. It divides by two lambda squared, where lambda is a parameter. Of course, in Python code, I can't call it lambda, so I call it L with a minus in front and then takes the exponential. That's a function that looks like a Gaussian bell, but the fact that it's a Gaussian bell is not important. It's just the exponential of a square. So what does that do with our sum? So in our double sum, we now have these expressions. Ooh, ooh, and this is a product of two Gaussian features. What is the cool thing about Gaussian functions? Well, if you multiply two of them together, you get something out that is also of this form because it's a product of exponential, so it's an exponential of a sum and the sum of two quadratic functions is a quadratic function. You can construct a quadratic function by something called quadratic, by completing the square in English or in German quadratische Ergänzung. So this object here is constant times sum over the product of these. If we multiply these together, we can actually look up what the terms are on our complicated Gaussian slide and we find that the product of these two is the exponential of minus the distance between the two input points, x i and x j, squared divided by four lambda squared. So four being the same as lambda plus lambda, lambda squared plus lambda squared and then the two comes down from above. And a term where the CLs are in there. So there's one expression which doesn't contain any CLs and one expression which contains CL. Now, because it doesn't contain CL, I can take it outside of the sum. It's like a extra term in the product. And then I'm left with one term which is a sum over F terms which look like this. And now I can look at my big reveal. Actually, I'll look at the code while you're leaving. Here's our plot. I already showed you that this looks like for 20 terms, you can still see the features. Actually, you can see the structure in the output as well with 10. And now as I add more and more and more of these, let's maybe say I go to 90, 80, you can see that this structure gets very regular. This expression gets very smooth. The y-range does not rise because I've decided to divide by the number of features as I did in the definition. And you get this very smooth kind of construction. Let me maybe draw five samples again so it looks a bit more interesting. So we get this kind of distribution. So you can kind of guess that there is a limit somewhere that we're going to get to. What does that limit look like? Well, it's on the next slide. So here I've copied over again from the previous slide. It's now the exact expression from the previous slide. Here is now correct. There's no more bugs. And now what we're going to do is we'll take the limit of F to infinity. That's what we just did in the code as well sort of. And you saw that 200 is almost pretty much infinity. So as the number of F features in this box of length minus eight to eight goes to infinity, the value of C L in this sum is... So C L is on our lint space for minus eight to eight, right? The value of C L within a bucket of with delta C is pretty much constant because now delta C is tiny. So within each delta C bucket, the value of C L is constant and we can think of this sum as an integral with a Riemann sum where this delta C converges to the measure DC. So that gives us actually an integral over C min to C max over this function. Ooh, but this is a Gaussian integral. And Gauss already knew and actually Laplace knew how to do it. So now let's make our life simple and let's just say we take C max and C min also to infinity plus and minus infinity. We just move them outside. Why would we need to do that? Because if you don't do this, we get these little curves here on the side. So if I increase this, let's say I go from minus eight to minus 20 because then I don't see it on the plot anymore. For the purposes of this plot, it's like 20 is infinity. I need... No, there's something wrong. No, no, no, that's something else. Feature gets evaluated. Yeah, yeah, I said, oh, I didn't pass them over. Okay. So now this is gonna be difficult to type without seeing it because I need to type on the keyboard without looking at the... It's the beauty of life coding in a lecture as always. Nice, okay. So now we're done. We don't see the bounds anymore and we can increase further. Like, I don't know if we don't need to increase further. Then we basically turn this integral to an integral from minus infinity to plus infinity and now this is a Gaussian integral. We know what its value is just no matter what X i and X j are. They are like shifts and they don't matter because we are on an infinite range. So the value is just the square root of pi times lambda times whatever is outside. So we're just left with this. So what we're doing here, and we just briefly make this point, is we're constructing an infinite sum over infinitely many features by evaluating a single function. And that function does not contain any features. I can just call it for any pair of X i and X j. And in fact, I can implement this here below. This is the kernel. It's often called the square exponential kernel because people want to make a big point that it's not Gaussian. It looks like a Gaussian bump, but it's not nothing to do with the fact that we're constructing a Gaussian process. It's a kernel that uses square exponentials, sometimes also called the RBF kernel for radial basis functions. But radial basis functions is a bad name because there's lots of things that are radial and build a basis, so not just those. So let's call it the square exponential kernel. It's not the squared exponential because we're not exponentiating a square. It's the square exponential kernel because there's a square in the exponential. And what it is is you take in X and Y and then you subtract them from each other, divide by some length scale L, square them, sum over all of the entries. Actually, the sum should probably be outside of the square. Yeah, okay, it doesn't matter here because there's only one axis. Divide by two, take the exponential and multiply it with some outside term. And now I can call this Gaussian process object which takes in a mean function, the mean function is still the zero mean function, and takes in this function which does this infinite sum over features and there are no more features anymore. That's why this thing is called a non-parametric model because there are no parameters. Well, actually there are infinitely many parameters but they're just all summed out so you don't see them. And I can call it and it makes this plot. And there are no features. So this piece of code allows us to operate on probability distributions on function spaces without explicitly choosing a set of parameters and tracking that set of parameters. So now the only thing I have to tell you before we end is you could be worried that, oh, by the way, of course you can also condition on data now, right? How does this actually work? I have a slide on it if you want to look at it but you can also basically just go back and look at this expression and see, oh, we can condition. We can condition once we have features as well. So we can decide, actually I'll talk about this on next Monday because it's not so complicated but I can show you the code. I can show it. I can of course now also load data and do inference on this finite limit data and get a posterior out that learns a function directly. So the only thing you might now be worried about is, is this enough? Like, is it, what we've done is we've taken our feature construction from last Thursday and just said, let's just increase the number of features to infinity. And sometimes we can do that and then we get out this function, this particular kernel. Cool. But maybe there are other functions that cannot be addressed in this feature way. Can I reach every possible function that might want to be a kernel in this way? Actually it turns out yes. So there's a set of statements that are due to Mercer and that's why these things are called Mercer kernel that says, first of all, any function that can be written like this or actually like an integral like this. So anything that can be constructed in the form that we just did is a positive definite function. And you can convince yourself basically by reading the proof down here which really just says if your matrix looks like this so if it's an outer product of infinity or finitely many features then if you multiply from the left and right with any vector you can take the vector inside of the sum and you'll notice that you're actually computing a sum of squares of real numbers and because squares of real numbers are positive the resulting thing is gonna be positive. But there's actually a beautiful statement that goes the other way around as well. That's the real theorem. Any positive definite function. So any function which is a kernel can actually be written with such a representation and it even turns out that the number of terms in this sum will always be countable for any such function. And why is this the case? You would need either to read the paper by Mercer or go to a functional analysis class which are largely about all of this stuff and this isn't a functional analysis class so I'm not going to tell you but we will go back to think more about kernels in some later lectures. I'll keep the conditioning part out. We just need to think a little bit about how to do it properly in code but we can do that next Monday. Keep in mind that next Thursday is a public holiday so there won't be a lecture and there won't be a tutorial. Here's my summary. Please do provide feedback about this class, about today's lecture. What we discovered today is that if we restructured the code right then what we did last Thursday was actually already inference on functions. We just decided to write the space of functions that we care about in terms of explicit features. So what we first did was we healed this problem in our code by rewriting this object called the Gaussian process which tries to be as functionally implemented and as lazily evaluated as possible by allowing the constructor to take functions. And then we discovered that that's actually a really cool thing because it allows us to do something inside of that function that keeps track of infinitely many features in finite time, intractable time. And that was possible by choosing a very specific choice of features and now of course your question might be is that the only one? Is that the kernel? Is that cool? And it'll turn out no, the story is way, way more powerful but we need to talk about this next Monday. For now it's enough to know that there's one such kernel, one infinite expansion of features which allows us to do inference on functions. Thank you very much.