 What I want to do today is an example that I've taken actually pretty much verbatim from Carl Vassmusen's book on Gaussian process regression. It's been in there since 2007. And I've updated it a little bit. I've rewritten the code for myself. And we're going to go through the code slowly. Here's the data set. You already know it. You already knew it before this term started. But you also played with it in the very first week. So here's CO2 levels in the atmosphere over the Mauna Loa volcano on Hawaii, where the NOAA, the National Office for Air Atmosphere, something, something. So the US version of the Deutsche Wetterdienst has been collecting this data since a very long time, actually, but digitally since the late 1950s. So there's a little hut somewhere on this volcano that has been collecting CO2. And this is a curve that all of you have seen because it's in the media a lot. It's kind of a placeholder for what's happening with humanity and our atmosphere. And the natural question when you see a data set like this is, well, what happens on the right? What is our future going to look like? And early on in this term, we gave you this data set because it's very neat and convenient and simple and one-dimensional. It has a reasonable size. You can sort of play with it very quickly without spending too much time on training and waiting for the computation. And we said, come up with some model that predicts what happens next. And I know that that exercise was a little bit open-ended. But the main message that we wanted to get across with this is for you to see that building a model for a curve like this is actually not so straightforward and that this message that is often perpetuated in the amateur public about machine learning, that is this black box where you just click the button and train the deep neural network, doesn't quite work. It tends to produce outputs that are a bit boring. So some of you produced models that produce straight lines that go not quite through the data, but sort of something like this. And that isn't very convincing. And some of you produced models that then stayed flat afterwards or had some other problems. Maybe you managed to get a smooth curve through that, but then it extended weirdly. And today, we're going to think about what we can do to build a better model. So to do this, I'll actually mostly going to spend today in code rather than on slides. And we'll see how far we can get. So I've uploaded on Elias also this iPython notebook. So if you want to compute along, you can try that. I've already loaded the very first few cells. What I'm doing at the top is I'm just loading a lot of different libraries. Most importantly, I'm loading our own Gaussian library that we've constructed over the past few lectures. This is our tool set, our toolbox that we've built ourselves in this lecture so far. So it's this file that contains the Gaussian class, which is the parametric version, a concrete distribution for which we can sample and which we can evaluate, which has means and covariances. And also, the lazy version of it, a Gaussian process class that just contains functions that only instantiate Gaussian objects the moment you actually say where you want to evaluate them. So I have then loaded the data. I've done this before the lecture because I wanted to make sure it works because it comes directly from the web. There is no file. It contains the very last datum, which is from, I think, last week. And now we can make a plot. So some of you also asked, how do we make these plots? So today at some point, I want to also spend five minutes talking about the plots. But this bit here is not an interesting part of the plot. I'm just going to plot the data. So I'm literally just plotting the data points and then just make sure that the plot looks nice. So I'm setting all the grids and axes and so on and so on. So we get exactly the plot that you just saw on the slide. By the way, if you don't know how to make such plots, I hope that at some point, as a side effect of this lecture, you do actually look at this code up there and see what it does and why I set things like this. Because making a good plot is a big part of telling a message. But it's not something that makes for an entertaining lecture. If you stand here and talk about where I'm going to put the minor and major tick marks and so on, that's a bit boring. But I hope that you have a look at some point at how I actually make these plots. So now, what's the very first thing we could try? If you put yourself back in your own shoes from, I don't know, six weeks ago before getting into statistical machine learning and probabilistic machine learning, what's your sort of knee-jerk first model you would try? And you can deliberately choose a very simple one. That's straight line. Someone is already going like this. So we've now learned how to learn straight lines. We describe the function that we're trying to learn as a sum of two terms, literally actually a linear projection of two numbers, an intercept, and a gain. And we've learned that we can write this with features, if you like. But we can actually also do it as Gaussian process regression. But let's not talk about that yet. But let's first talk about this sort of simple setup. We learned how to do this. We decided that we built these feature functions, phi. That's a function that only returns x and 1. This code is the general version of it. That could also work for x squared and x cubed and so on. But today, we're only going to use it with two features, so with the constant and the linear function. That allows us to build features of the data. And then we can do, well, Gaussian regression, parametric Gaussian regression, also known as least squares regression, which in our language involves creating a prior for these two numbers, w1 and w0. That's the bivariate Gaussian distribution with a zero mean and some kind of big but simple diagonal covariance matrix. And then we conditioned this prior on the fact that w1, w2 mapped through the array that arises from evaluating this feature function, phi. That's a long array applied to w is equal to y up to some noise. And then behind the scenes, a lot of linear algebra happens that we implemented ourselves. So we can let this Gaussian class take care of this. And then we get a posterior for the weights. But we want to have a posterior over the function. So we take the bivariate distribution over the weights and project it back out into the function space by multiplying with the features phi of some plotting grid, x. So little x, I've designed somewhere. Where did I set it? Probably up here. This is a lint space that goes from 1930 to 2053. That's the x limit of this plot. And I can plot the result of this process. And it's a straight line. So there's your answer to your straight line. It actually has uncertainty, but there's so much data that this thing becomes very, very confident about which straight line this should be. So you don't see the samples anymore. You don't see the uncertainty. OK. And clearly, this is boring, right? But it's something you can do. And at this point, it might be useful to reflect on the fact that much of data analysis actually is like this. There's a lot of fitting linear trends. You've seen lots of social media posts of someone drawing a straight line through a cloud of data and then going, oh, it's deep, whatever. And of course, for the CO2 data, this is a bit boring. I just want to note on the side, because someone asked, can we not start with a linear kernel, that you can think of what happens here. This phi transpose phi, the inner product of the features, as evaluating the function given by function AB, actually it's defined here, function AB equal to A times B plus 1, because that's what the inner product of a function that returns x and 1 with another function that returns y and 1 just gives, right, with a diagonal covariance in the middle. And then you could talk about this as a Gaussian process. It's a Gaussian process with a very specific kernel. It has the property that the kernel, actually it's up here, kernel of AB is equal to AB plus 1. And this is called the linear kernel. And somehow it's very important. But if you do this, then the code looks a bit different. We define the kernel. We define a mean function. Then we instantiate our Gaussian process on this mean function with this kernel. We also get a bit of some parameters. That's the 100. That is the scalar term in front of the diagonal covariance matrix. And then you run this. And it does the same thing again. And it takes a bit. It's the exact same output. So linear regression is also Gaussian process regression. So actually, the biggest problem with this linear regression is maybe that it patently does not answer the question on everyone's mind. So the thing we want to know is what happens next. And I'm pretty sure none of us hopes that this is just a straight line that you like this forever. But the entire point of the big public discussion at the moment is to get this blue curve to turn back down again, right? Or to at least get flat. But a linear trend will never be able to model something like this at all. It has to be a straight line. So this model doesn't even allow us to think about what we actually care about. So we will need a regression model that can learn a nonlinear function. Otherwise, no matter how simple it is, it cannot predict something like going back down or even becoming flat. Or even just flattening out a bit of this curve. So we spent the last few lectures doing Gaussian process regression. So let's try that. You could now use, of course, if you didn't have this class, you could download some scikitlearn.gaussian process and figure out how to do this. There is a fit function because scikitlearn always has fits. And then do that. But we don't need to do this because we've written our own code so we understand better what's going on. We're going to define your textbook standard Gaussian process regressor for that. You have to decide which kernel you're going to use. And we learned over the last, actually last lecture, the one that had clearly too much content, that there are a lot of kernels. That there is a kernel called the square exponential kernel. And one called the Wiener kernel. And one called the cubic spline kernel, the integrated Wiener process that arises from VLU features. There is one as a whole family called the Matern class of kernels, and so on, and so on, and so on. So we have to pick one. And a good way to pick one is to think about the kind of samples that these processes produce. And one we encountered on Monday is this rational quadratic kernel, this one. That's a kernel that can be interpreted as an infinite scale sum over square exponential kernels. So it models functions that are smooth. So they are infinitely often differentiable. But for which we don't quite know on which scale they change. The square exponential kernel is this very smooth thing that produces functions that just kind of wobble up and down on one scale. But this function clearly has several scales. It's sort of a little bit locally wobbly, and then it has this long thing. So we could maybe use this type of model. And this is a kernel that looks like this. So it's a function of two inputs and a bunch of parameters. The inputs are A and B. Here it is in code. So it's a function that takes in A and B. Let me get the pointer. And then it has parameters. And these parameters describe what the sample space is going to look like. The first parameter, L, is we learn to think about this on Monday as a transformation of the input data by a linear scaling. So we take all the inputs A and B. And we probably got this wrong. Should have maybe scaled it. Well, I bought that also much, but it's unknown. As I did it correct in here, it's just wrong in this cell up here. Let me fix this. So this should be L squared. No, no, I got the wrong thing. It's difficult to type when you're not facing the screen. And if you're not using the right keyboard layout, so. OK, so this is like taking the inputs A and B and dividing them both by L. So that tells us on which scale the function is going to change. This is important because this data has units. It is the inputs are measured in years. So L says on which scale of years do we expect the function to change. And I've set the standard setting should be 1. We're going to maybe think about this in a moment. Alpha is this dimensionless quantity to which power we raised this. This comes from the gamma prior for the scale mixture. If alpha is very small, if it's less than 1, and between 0 and 1, then the function changes on a lot of different length scales. And if alpha goes to infinity, then the function is very smooth and only changes on one length scale. So the standard setting might be 1. And theta is the output scale. This scales the y-axis up and down. And it also has units. The units are plotted on the left of this plot. PPM, parts per million. That's the quantity in which we measure CO2 content in the atmosphere. The famous numbers that now in your generation range between 400 something and 400 whatever and used to be 280 in our, like, my grandparents' time. So the scale is it's the, so this kernel arises by taking a square exponential kernel. That's a Gaussian function. And then integrating it against a gamma distribution, which has a parameter alpha. And that alpha says something about how broad the distribution is. If it's very large, then the distribution is very narrow. And if it's very small, then the distribution is very wide. So that gives us either lots of length scales or just one length scale. Yeah, so at some point we'll have to think about how we fit those hyperparameters. And we'll do that after the break. Okay, so if I choose this particular kernel, I'm gonna set the length scale to one, alpha to 0.2 and theta to three, because this plot ranges over, you know, something like, what, 80 different values over something like a little bit less than 100 years. And I'm also going to do another thing. I'm going to add a constant mean to this Gaussian process. Because this function up here, clearly it's not centered at zero. Because the CO2 content in the atmosphere is not zero. So we have to account for that. Otherwise, our model assumes that this function moves around zero, which is just wrong. So what I've done is I've computed the actual mean of the data, the empirical mean. Here it is. And then just said the Gaussian process prior should be the mean of the data. That's another way of centering the data. So in statistical analysis, people often talk about centering the data. I think this is sort of a bit, it's not really the right thing to do because data has units and you want the plot to show what the data actually looks like. You don't want to take the data and remove structure by subtracting the mean and strailing by the standard deviation. Instead, you change the model so that it knows that the data has a mean and a variance. So if we do that, then we can do again Gaussian process regression. And Gaussian process regression for us is now really just defining the prior and then calling condition on the data because we've implemented it in code in this library. If you then make this plot, the rest is just plotting and maybe we have time to talk about the plots at some point, but I don't want to start with that right now. Then here's the plot, this is what we get. So in the back in gray, you see the prior with the constant mean. So this is a model that assumes that the function is low. It's what's called stationary, so it returns to its prior mean and it wobbles up and down on this scale, the output scale that is given by three as a standard deviation, so mean plus minus three because I've set theta to three and it wobbles up and down on the scales that are sort of a year plus or minus of it. And the posterior then looks like this. So you see that it fits to the data. That's already better than the linear model because it actually captures the structure of the data locally, but then it extrapolates very badly. The reason for this is that this stationary kernel assumes that the function on long time scales always returns to the prior mean. It just assumes that as part of the model. It's nothing extracted from the data. So if you want to make a statement about what might happen next in the future, this model is useless because it is constructed to predict a return to the mean. So if you want to fix this, we need something else. And now the next thing you might try, yet I'm still kind of constructing a bit of a straw man, so I'm not gonna ask you what to do because it might feel bad afterwards, is to say, well, maybe I just got the length scale for this thing wrong, right? Maybe on a long enough time scale, everyone's survival rate drops to zero, right? If you just make the length scale long enough, maybe the fact that the model returns back to zero is not even so bad, right? Once humanity dies out, it will return to whatever. So maybe let's just use a model that has a much longer time scale. So I'm gonna use something that I've, they're liberally deceptively called the long-term trend kernel, which is actually just a square exponential kernel in disguise, but I wanted to show that there will be a long length scale. And so that kernel, you can, oops. At this point, maybe you can read it, is the square exponential kernel. So it takes two inputs, A and B, it subtracts them from each other, squares the distance, writes a minus in front, divides by two times a length scale squared, and puts an exponential in front and scales by the output. By the way, it also eats this last axis. Maybe I should have talked about this a little bit more because our design assumptions for our code is that input data is always of shape number of data points by number of feature dimensions. So it's always of size N by D, where N is the number of training points or the number of points at which we evaluate, and D is the number of dimensions that the data has. And this data is one dimensional, so D is one. But there might be other data that is two or three or four or five dimensional or 28 or 768 dimensional or whatever, and then we want that axis to go. And that also means that we have to always assume that the axis actually are a raise of size N by D. They are never vectors. So that's why we instantiate the data up there actually as axis with the right shape. So I sort of say there has to be one dimension even though they are actually vectors. So, where is my long-term trend kernel here? So here, the main thing I've done to this Gaussian process prior is that I'm going to claim that it has a very long length scale, 20 years. Or you could wonder whether 20 is even long enough, but maybe let's start with 20. And I'll set the output scale to something larger because if you look at this plot, maybe this is a bit too narrow, right? The prior doesn't actually capture the range of the data. Yes? So this is not a rational quadratic kernel anymore, but the kernel up here has this shape. This kernel is different. This kernel is, I'm going to write it down, AB, A, this is, it has no parameter alpha. It just assumes that it has one length scale. And we've set the output scale to 10 and the length scale to 20, if we do this, we get a model that looks like this. Now the prior is a bit broader. I could actually maybe even make it a bit broader, right? Maybe we should say 20 so that it actually captures the range of the data properly, wait for a second. So now the standard deviations, the range of the prior captures the range of the data quite well. It's sort of centered on the data. And the scale of this process is 20 years maybe, so it kind of goes up over, actually maybe it should be more like a hundred, something like this. Let's change it to a hundred. So the length scale should maybe be a hundred years. And now the model is going to extrapolate a little bit further to the right, right? Ah, okay, good. Now you don't really see the uncertainty anymore because I've made the length scale so long. I think that's why I originally had it at 20 so that you can see that it actually is a Gaussian process, not just a red line. So now at this point, we have a sort of, it's a much, much better model than what we started with, right? It's not just a straight line anymore. And now we can start to think about what we actually want to achieve with a regression model, like this Gaussian process or any other regression model. We want it to capture what we know about the data. What you probably currently are doing in your head is you're looking at this and going, oh, but hang on, there's lots of things wrong with this. So first of all, okay, maybe you can criticize it a little bit. That's a good job for you. What's wrong with this model? So there's this local structure in the data that this blue curve clearly goes up and down and up and down and up and down. And can anyone guess what this is? Yeah, it's an annual cycle. It's the seasons coming up and down. And actually, I mean, this is sort of the, it's obviously the answer, right? But it's not actually an answer. So it just says, well, there's a yearly cycle. But why is there actually a yearly cycle in CO2? I'll be honest, I actually don't know either. We should ask an atmospheric scientist. I'm guessing it's something to do with some global, you know, weather patterns. You could now wonder, is it because the northern, so in Mauna Loa is actually quite close to the equator, but north of the equator, is it maybe because humanity produces more CO2 in the winter or something like this? I'm not sure. Is it because like big global clouds of CO2 mix over the Pacific and they kind of move up and down over the years? Actually, I don't know, right? But it's clearly an annual cycle. So this is something that our model doesn't capture. There's this periodicity is not captured. Is it the Gulf Stream? Okay, so maybe we have an expert in the room. What else? What else is wrong with this model? It underestimates the uncertainty for extrapolation. Yes, so one way to fix that is to reduce the length scale again. Maybe we could go to 80, right? Or maybe 50, actually. Now we have to wait a little bit. Then we get a little bit more uncertainty. Any other complaints about this model? So when you say it underestimates the uncertainty for extrapolation, what do you actually mean? So you're saying you would like the uncertainty to increase faster as the data ends. So we could do that by reducing the length scale to one again, right? So, oops, now I've made 10 more than one, stupid. Okay, let's really do the extreme thing and say it could go back to one. So now we're very uncertain, right? So actually there's something more subtle going on. That this model only offers us some knobs to twist. The length scale and the output scale, nothing else. And those two knobs, they affect several aspects of the model. They affect how quickly the uncertainty goes, but they also affect how quickly the model returns back to zero. And we don't really want those two things so strongly mixed. And why? Well, because there is something about the data that we know that isn't yet reflected in the model. Because we know a lot about this data. It's very close to us, right? We sort of, we live in this air pocket together. We kind of expect, we know the dynamics of human behavior. We sort of think there's probably something gonna happen over the next 30 odd years or so that might have an effect. We think it's hopefully gonna flatten out, maybe come back down. We haven't even looked at the past here. We also kind of know where this came from probably, which will probably be quite straight down here. So we want our model to reflect this. And what that means is we need a good prior. What we're just discussing here, this problem, is that we don't have yet a good prior and that to make this model work, we need to design a good one. And whether you call this prior a prior or a regularizer, it doesn't matter. Whether you have a Bayesian perspective on it or a frequentist one, you need to include this knowledge about the problem in the model. And no amount of statistical trickery is going to fix this problem. We will need to put this in ourselves. So one way to get towards such a prior is to look at this data and maybe we can actually repeat this observation that we just made that there's sort of two things happening in this process. There's this periodic thing, this local annual cycle that goes up and down, up and down, up and down on a relatively small scale, right? It goes up and down on a scale of let's say five PPM, maybe even less, one or two. And then there's this global trend of a rise that goes up. And before we talked about where the two come from, so we have some kind of causal idea. One of them is an annual cycle and the other one is our human behavior. And maybe we can argue that those two, they have little to do with each other. So the mathematical concept for this is independence. And I started in the second lecture pointing out that the notion of independence is actually kind of a dangerous thing because independence is a mathematical statement. It says something about the shape of a distribution and not having something to do with each other is a very human statement. It's a philosophical thing, right? So we just equate it to each other because they're kind of convenient but it's a little bit dangerous to do that but we'll just do it anyway. Yeah, so the question is could we also do this in frequency space? We could have a model that sort of knows which frequencies are in the data. That's actually an option and there's a very subtle deep connection between what we're currently doing with this kernel and learning frequencies that I might have time to talk about on the Monday after the end break but maybe not because it's a bit deep. But suffice it to say that we actually kind of learning frequencies here but even if we wanted to do this concretely so one thing, one way to do this would be to take the data, do a Fourier transform and then in frequency space assign two frequencies, right? One that is equal to one for the year because the data is scaled in years and another one which is equal to, hmm, we don't know. So this data is clearly not periodic but hopefully not. Maybe it comes back down on a very long time scale but even if we did that we would effectively model two functions just in frequency space, right? One that is periodic on period one and not one which is periodic on period, I don't know, 100 years or 200 or 1,000 years. So this is a situation where we are actually learning not just one function we're learning at least two, a periodic one and one that has a much longer time scale. So at this point I do want to return to my slides for a moment and talk about what we would do if we wanted to learn several functions. So here we actually have, let's say two functions for simplicity sake, a first one and a second one. One way to model them is to say well let's put a model on both. One Gaussian process, why Gaussian process? Well because it's the only thing we have so far so this is a Gaussian process. For the first function with its own kernel and its own mean function I'm gonna set the mean function to zero for without loss of credibility because we already have a mean function it's the constant function we can add it post hoc. It's not gonna change much it just shifts the plot up, right? And another one, second function with its own kernel. So there are two kernels K1 and K2 and advanced warning at this point the indices for K and F are at the bottom there are subscripts on our later slide I'll move them to the top because then otherwise we can't start doing kernel algebra but here I'm just using this very simple picture so it's fine if they're at the bottom. So we're gonna claim that these two functions are independent of each other. So one way to do that is to assign a joint Gaussian prior to them, Gaussian process prior with a covariance matrix that is diagonal. So what does this expression actually mean? Well it means that they're literally two different Gaussian processes in our memory they don't interact with each other they're just there and this is a, it's a very hand wavy abuse of notation to say there are these two functions that are both infinite dimensional, right? They're functions and I am thinking of a matrix that is somehow block diagonal, right? There is this matrix valued function K on one block of the diagonal K1 and K2 on the other block of the diagonal. Now the challenge arises because what we see that data that we've measured doesn't measure one of these functions or the other it measures them both, right? We see this curve that goes wobbly, wobbly, wobbly up so it goes up and it is wobbly. It's both functions at the same time what we see is a mixture of these two functions. A simple model for a mixture is that they are a sum of each other. And the fact that we model them as a sum also implies that we model, we assume that they don't interact with each other even in our measurement other than just being added together. And now for yourself you can wonder whether that's actually correct or not, right? Does the annual cycle have something to do with the long-term trend? What would that actually mean? How would such an interaction arise? It's a bit, maybe it's actually not so bad to say there are just two functions that lie on top of each other in our data. So we can model that by saying what we actually observe is a function f that happens to be the sum of the two, f1 and f2. And we can write a sum as a linear map with the vector one, one. Applied to the left, from the left to this vector f1, f2. So that's very nice because we know that Gaussians have beautiful properties under linear maps. A set of Gaussian random variables when you apply a linear map to it from the left remains a set of Gaussian random variables. Gaussians are closed under linear maps. So if you have two Gaussian processes for f1 and f2, then the sum of these two functions is also distributed like a Gaussian process. Like a Gaussian process whose mean is given by let's apply the map from the left to the mean function. Well, the mean function is zero, so it's still zero. And let's apply the map from the left and the right to the covariance matrix or function. And if you mentally go in your head, go through what happens here. I can sort of, I'm expecting some of you will go, maybe you can convince yourself that you're computing a sum of the two kernels. K1 plus K2. So if we build, we talked on Monday about the fact that if you add two kernel functions, they stay a kernel function, a positive definite function. There, the sum of two covariance functions remains a covariance function. When we do that, when we build a model that uses a sum of kernels, we're actually modeling a sum of functions. So let's try that. Whoops. So here, I'm gonna build a model that uses what I call a sum kernel. It's still a kernel, so it has two inputs, A and B, and then a bunch of parameters. And what it's going to do is it's going to sum two different kernels. The first one is the thing that I call the long-term trend kernel, which is this kernel. And I call it a long-term trend kernel not because of this form, but because I'm going to choose L to be a long, a large number. And I want this reflected in the code, so that's why I call it this so that I suggest if you always know, ah, this is gonna be on the order of 100. And then there's another kernel for which I use this rational quadratic kernel, this thing that moves on different length scales, and I'm gonna assume that that has a much smaller length scale. So I'm calling the L long-term, I said the default value should be maybe 20, and the default value for this other kernel should be maybe one, one year, up and down. And then after that, well, everything starts as before. I'm gonna create a prior, that is a Gaussian process with a constant mean, and it evaluates the sum kernel with these particular values for the parameters. Actually, I say a hundred years length scale for the long-term trend. Sorry, not a hundred years length scale. Sorry, sorry, theta, so that's the output scale. A hundred PPM output scale for the long-term trend, five PPM output scale for the local, up and down, and 50 years long-term trend, one year short-term trend, 0.2 alpha for saying lots and lots of mixed length scales in there. Then we condition on the data. Here, the posterior is we condition on the data, and then we make a plot. So I have a plotting function for the data, and then I plot the kernel. I'm sorry, the GP. So now this looks like this. We have in blue the data, you can barely see it. In the background, our prior, the prior now produces functions that, maybe they look a little bit better, like the data, not quite, but they get a bit closer. They move by closer. I mean, there's two, there's clearly two levels of dynamic happening. There's a local wobbling, and a long-term up and down. And this produces this prediction, which now maybe addresses your problem, right? Now we have quick uncertainty right after the data ends, because the local short-term trend enters, but this uncertainty that is being added on this scale only happens on the scale of five plus minus five PPM. And then after that, we get a growth of uncertainty from the long-term trend that gets added. What you also see is that the posterior mean still becomes as sort of a very smooth line. It's not straight, it's a non-linear function, but it doesn't actually capture this periodic nature. So what you would like this thing to do is to ideally capture this periodic behavior. We know that the annual cycle is not gonna vanish next year, right? The earth will keep rotating around the sun no matter what we do. So the data should, that the model should reflect that ideally. Maybe it's not our first concern, because I don't know, if you're talking to politicians or your colleagues, the fact that your model doesn't get this little wobbling thing is maybe not the first thing they worry about. It's also because as a society, we sort of, by the media we've learned to live with models to look like this, and I say, whatever, it doesn't get that. But still it would be nice if we could get it in. So let's do that before the break, because it actually turns out that there are kernels that can capture a periodic function. And this is the sort of thing where I could spend a long time talking about how to construct these, but I'm not going to, because it's something that someone just has to tell you that there's a kernel that can do periodic functions, and then you just know, and the next time you need to do it, you just look up the slides for this course again. So here's the construction. It comes from a text by my beloved David Mackay, which he wrote a long time ago, when the machine learning community was still only getting to grips with what Gaussian processes even are. And he had this nice construction where he basically said, what we can do is, we can sort of embed the circle in the Euclidean plane by a variable transform. So on Monday, I said, one thing you can do to kernels is you can transform the inputs, and then they still stay a kernel if you apply the input transform to both inputs. So the input transform that he proposes is, let's const, if you have X, if you have an input X, like our time, then you could construct a two-dimensional object called sine of X and cosine of X. And then those of these are clearly periodic functions. So let's call that new variable U of X. And then when we apply a kernel to this, that has a form of something A minus B squared, so like sum, sorry, over the output axis, like this kernel here, which in the multivariate version, if these have I's, then this is like a sum over these, right? That's what our sum over X is minus one does. Then we're effectively constructing the square, the sum of the square distances in this input space. And if you take cosines and sines and square them, the distances, then you're effectively measuring the distance on a circle. And you can use these trigonometric properties, but as this equation that you've learned in high school, or you didn't learn it, you found it in your formula collection that says the cosine of A minus the cosine of B squared plus the sine of A minus the sine of B squared happens to be equal to four times the sine squared of A minus B over two, over two. So that just happens to be the case. So if we implement this in our square exponential kernel, then we are effectively modeling a function that is confined to the circle. So as X moves forward, this function just keeps, or this U of X just keeps going around in circles. And so things that are close to each other on the circle, in particular, also close by several rotations, have to have the same function value. So if we implement this kernel, here it is. So it's a kernel of two inputs, they have called them X and Y, and they still have parameters. And it's equal to output scale times exponential of minus two times sine of something squared. So the sum over X minus Y, that's the bit in here, divided by two, hmm, actually Y, might that be not in here? Let's think about this for the moment. What I've changed here is that I would like this oscillating behavior, of course, to have a particular period. So in general, data doesn't have period one. If I wouldn't, well, if I would've just put the A minus B over two in the sine, then the period would be two pi, right? Every two pi radians, we go through the circle. But I would like them to have periods, whatever, periods. So I'm actually giving this thing a name. And so I take the data and I scale by period over two pi. So I multiply by two pi over period. And that means the two in the one-half vanishes, and there is a pi in front, and we divide by the length of the period. Now in our data, the period is actually one, but in general, we can set it to something else. And now there is a minor complication. If we would do this, then we would get functions that look like this. So they are perfectly periodic. They're not sine functions. They're just perfectly periodic. There's three of them plotted here. Actually in our data, who knows whether the annual cycle is exactly periodic every year. It might, the shape of the annual cycle might change. The period doesn't change, it's always one year. But what it looks like per year might go up and down. So to address that, I actually take this kernel and then I multiply it with this kernel. Because I said on Monday that the product of two kernels is also a kernel and products are a little bit like OR functions. So if you are different either on the period, so you're not close to each other on this periodic function, but OR different to each other in a long-term distance, then maybe you're not the same. So that's like a decay of the periodicity and this is functions that look like this here on the right. So they are locally periodic. So from here to here, this function goes down, up, down, up, down, up, down, up. But over a long time, if you look over here, it actually looks different from each other. Just to be able to model that this might not always be the same. Okay, and that's what I wanted to do so far because now we have actually, so we could just run this finally and then we can do a break. I've now built a model that takes in the long-term trend and this local periodic behavior. And if we use this, then we get this kind of prediction. So this function models that this function class, this prior models that our data has a local periodic structure and a long-term structure. So it'll capture this long-term whatever it might be and then predict into the future with nice periodic properties. It's still not a really good model. So we are going to change this after the break. But at least now, maybe you can appreciate that we've constructed something that actually captures a lot of the qualitative nature of our data, much better than your VLU Feed Forward Neural Network might have. Last question. So the first question is, why did we take a product here of two kernels and not a sum? So this is actually a sum of one kernel and then the product of two other kernels. So the product and the sum have different logical interpretations. A sum means in the covariance function means that the covariance between two function values is high if either one of the terms is large. So this is good for sums of functions that we model. And a product means the covariance between two function values is high if both of the kernels are large. And we use this to model the fact that the periodicity decays over time. You couldn't do this with a sum and you could think for yourself in the break what it would mean to have a periodic function plus some other thing. And the other question was, why did we not have this periodic extrapolation in the previous case? Because the kernel could not capture the fact that the function is periodic. It just measures distance on this X scale. And it just says, if you go away in X scale, then the function values become different. And so it can't do something like, oh, but three periods away, I sort of have to have the same function value again because the function is periodic. But this one does because it measures distance in periods rather than in X space. It's a transformation of the inputs. So let's take a quick break. I'll leave this picture up and we continue at quarter past at 915. At this point, we now have this model that looks like this. So we've managed to capture a long-term trend and a periodicity. And then actually, someone came up over the break and asked the exact date of the pertinent question, which is, oh, does this mean I can combine pretty much any sum and product of kernels to build into the model what I know about the data? Exactly, that's what we're doing. So if you stare at this model, you might convince yourself that this is still not really exactly right. And we need to change it a little bit further. So I'm going to offer one answer, which actually happens to be the one from the textbook. But of course, you have to know what you're self. So if you think that the model I'm going to use will include or not include some things that you don't agree with, you are free to choose them to change them yourself and do your own analysis. So what we're going to do is we're going to make this model a little bit more powerful, a little bit more expressive by including a few more things. So what you now do is as a good scientist, you go, you sort of take a step back and you look at the data again, which you can barely see here below in blue, and you say, okay, what's actually happening in this data? So there is this periodic trend, clearly there is an up and down every year, absolutely there. But also maybe not so important. Then there is a very long-term trend that has something to do with industrialization and humanity pushing CO2 into the atmosphere. But there's probably more happening. There's probably also something that maybe has something to do with the local vagaries of human existence, pandemics and wars and global crises going through that might make minor dense that don't really have a long-term effect but they still come through. Then there is also things that have nothing to do with humans. There might be climate oscillations on the planet that aren't even affected by us. Maybe there are solar cycles that are somewhere in there on something like a length scale of 11 years or so. Various other stuff. And then of course there's also the very short-term things. There's measurement errors on this volcano, there's local weather in Hawaii, whatever that has nothing to do with what we care about, the global effects. So we should capture all of these and they all contribute. So we build a model that looks like this. So I'm going to use a long-term trend for the long-term change and behavior of the CO2 atmospheric content, a periodic kernel for the local periodic annual cycle. And here I'm using this product between a purely periodic kernel and a decay rate over time because I'm assuming that this periodic behavior actually isn't perfect over thousands of years. It kind of has its own local variations. And then there's a mid-term trend which might capture something like global economic cycles or solar cycles or things like this. For this I'm going to use a rational quadratic kernel again because why not actually, right? It just says there are different length scales in the data that all happen around a length scale of roughly a few years, but maybe plus minus a bit. And then there is noise which something you might call noise. So the word noise just means the stuff I don't want to model. It's a typical kind of thought in like scientific modeling. Noise doesn't necessarily mean it's measurement noise. It's just the things that really don't matter for our analysis. So for that I'm going to assume that we will use first of all again a square exponential kernel. So this one, but with a very short length scale. For that I'm probably going to set the length scale to I don't know, 0.1 years. Just up and down tiny monthly things. And then even a totally independent noise. So this final term here is just this equals kernel. So it's just one if the inputs are identical and zero otherwise. And this is literally a diagonal matrix, right? It's just plus sigma times one. So independent noise on each measurement. It's like whether there was a scientist standing next to the probe while it was taking its measurement and breathing into it or not. So it's sort of minor effects, yes? No, it's just I wrote the code earlier and then I changed some of it to A and B so we'll not be confusing with input and output. And then I forgot some of the cells and did this last night. So A and B and X and Y are all the same thing. It's just a function of three inputs. Just like when you write an integral, you can do the integral over DTE or DTE prime or whatever. So, and now this is sort of the first part of building a model, building a prior is deciding which kernels we're going to use. What kind of qualitative nature in the data we're going to model. And the other thing we need to do is to set all the parameters for these kernels because they affect the quantitative nature of the model. You have to scale them in the right way. You have to say, we measure CO2 content in PPM. You measure time in years. So you have to make sure that the parameters match those units. But also that they match their relative size. We know that the periodic component is small. It happens on an output scale of plus, minus, I don't know, three or five or one. We know that the long-term trend has a large output scale. It goes up by, I don't know, order of a hundred PPM. And that the length scale of the long-term trend is something like maybe 100 years or 50 years while the length scale of the local behavior will be something like, I don't know, 10 years, five years, 15 years, something like this. So I'm going to say that the entire model, this is my definition of a model, that's the non-parametric version of the kind of code you know from deep learning, right? When people set up a deep neural network, there is this cell somewhere in the code that goes, this is the architecture of the network. The corresponding thing is this. I'm going to use a Gaussian process prior, which consists of lots of inputs, A and B or X and Y, bunch of parameters. And those parameters actually have names. I unpack them inside of the function and give them names. They are called theta if they are output scales. They're called L if they are length scales. And they're called alpha if they are or shape, actually shape if they are dimensionless for the rational quadratic kernel. And one of them is for the periodicity, one of them is for the decay of the periodicity, one of them is for the length scale of the long-term trend and so on and so on. And then I built a kernel, which is the sum of the long-term trend, plus the periodic trend, plus the mid-term trend, plus the noise. And now I have to set what all of those parameters are. And your question was already in the beginning. How do you set those parameters? Well, I could set them by hand. So this is what I do here. And I always have to do this initially because I have to start somewhere. Just like you initialize a neural network by setting an initial set of random weights, we're gonna initialize our parameters by setting them to some values. And I've deliberately chosen values for these parameters that are sort of same, but also very rough. So I forced myself to not set them to very precise values that I could guess, to set them like, oh, you know, you could have looked at the plot and said, oh, the output of the long-term trend should be maybe 80 ppm, because that's what it did. So I'm not doing that. Instead, I'm setting them all to basically multiples of five, or maybe even multiples of 10 over two, something like this. So the output scale of the long-term trend will be 100 ppm, the long-term length scale 100 years, the output scale of the periodic component, five ppm, that decays over 50 years. The periodic length scale is one. That's the one thing we're not going to fix, because not going to change, because we know that it's true, right? It's just years, and so on, and so on, and so on. So I've deliberately kept them as rough as this because then it doesn't look like I've fiddled with the numbers, basically. And now I just make an array out of them. I define the Gaussian process prior as before, now with this kernel, condition on the data, and make a plot, so now we can run this, and it makes this prediction, which we could just take at face value if we really wanted to. So I think this actually looks pretty decent. It's not bad, it's probably better than any of the stuff that you submitted in your first exercise sheet, right? It has the periodicity, it becomes uncertain over time, it also becomes uncertain on a scale that looks reasonably realistic. It's kind of, this actually might happen in the future. Maybe it's not quite uncertain enough, I'm not sure. It also extrapolates well on the left, it goes through all the data, so it doesn't have to interpolate between the data, it just actually nicely goes through them, but of course it's not perfect. So one way to check whether it's perfect or not, a simple sort of tool, like a debugging tool for such models that you can only do if you use a probabilistic model is to draw functions from the prior without conditioning on the data and just plot them in your plots along with the real data and see if you can make them out. So I do this, here I'm just gonna draw five sample functions from the prior, not from the posterior, they're not conditioned on the data, and alongside I'm gonna plot the data. And now we can just look at them. And now the question to you is, can you make out the data? If you can, you cannot, and if you can't, you can shake your hand. Yeah, so some of you can make it out, you can still make them out, right? For those who can't, they're here, right? And a nice thing about this is you can look at these samples and say, what's wrong actually with my model? What do I need to change to make them look a little bit more like the data? And one way to do this would be you visually looking at the data, maybe the output scale is a little bit too small, maybe the length scale of the long-term trend is a bit too long. Okay, so that's the pedestrian thing, and you can do it for data like this. Of course, you couldn't do this if this were a vision model, where the input data is 768 dimensional or 50,000 dimensional, or if it's text, if it's a language model, you couldn't possibly look at outputs from the model because they would be very difficult to interpret. Yeah. Yes, so can we automate this process? Can we automate what we would do visually? And you can, and it's called machine learning and we're going to do it now. But I still wanted to show you this plot to get this across because what we're now going to do is literally an automation of exactly this process that you were doing in your head. You would look at this and say, the samples don't quite look like the data. What is the best change I could do to the parameters to make them look more like the data? And let's see if we can formalize this. So this is called, in classic, Bayesian machine learning, this is called hierarchical learning, or marginal learning, or optimization of hyperparameters. So there used to be this big, or actually maybe there still is in Bayesian circles, this sort of big distinction between different types of numbers in your model. There is something called data. Data are the bits that you observe. So they go in Bayes theorem on the top in a likelihood on the left. And then there are what you might call variables. Variables are the things that we care about. So our variable is the function F. Then there are parameters, which are things you sort of have to set. For us, the parameters might be the kernels, we've decided to use particular kernels, and we're not going to be uncertain about the fact whether we use this kernel or that one, we just set them. And then there are things that you might call hyperparameters, which are the things that go into the parameters to make them work. And this distinction is clearly, it's a little bit hand-wavy and constructive and it's not very clear. Sometimes there's a bit of confusion between them, but they show up in different places. So the variables are the things that are on the left-hand side of the prior. This is the bit you are uncertain about, about which you do inference to construct the posterior. The parameters or the hyperparameters are the things that go on the right-hand side in the posterior, but not on the left-hand side in the likelihood. It's the stuff that you just have to set. So how would we tune these parameters? Well, how do you learn anything? Base, base, data, no, the answer's always Bayes theorem. Inference is always Bayes theorem. If you don't know what to do, there's a quantity you don't know, well, Bayes theorem. That can always be your nature of the action, even in the exam, right? If I don't know what to do, what would it be, Bayes theorem? So what does Bayes theorem actually do? It computes the posterior, which is given by the prior, times the likelihood, divided by the evidence, is the evidence is stupid, because it's just a constant. Actually, hmm, if you stare at this expression here down here for a bit, you notice that this here, this evidence, this is actually a likelihood itself. But not for the uncertain variables that we'd like to infer, but for the parameters. It's the marginal likelihood for the parameters, which defines a probability distribution over the data when you integrate out the variables. So what we could do is, we could now continue to do Bayesian inference by multiplying a prior for the parameters with this thing, this evidence, which is the marginal probability for the data given just the parameters, not the variables, and then do Bayesian inference again. So we divide by some other kind of evidence term. And that's what we would like to do, and that's what we're gonna try to do. The only problem is that in general, this will create a much more computationally complicated problem, because remember we've constructed our Gaussian process prior very laboriously to be computationally efficient to have all this linear algebra that helps us to be able to do all of this. And when we introduce parameters like this, this is not necessarily going to work. So what does this actually look like for Gaussians? So here's a slide with lots of complicated math again. Here we have our model, for us the variables, the unknown thing is F, the function, and the parameters are, well maybe the parameters are the mean function and the kernel, and the hyper parameters are the parameters that go into the mean function and the kernel. So they are the length scale, the output scale, the period and so on. And if you look now at Bayes theorem, and actually you look like what this looks like for Gaussians, then you can remember that I told you before, for Gaussians the product of a Gaussian prior to the Gaussian likelihood is equal to, I might have said just another Gaussian, but actually it's another Gaussian times a normalization constant, which is the thing that in Bayes theorem goes below. So this is our evidence. This is the probability distribution over the unknown thing, the weights or the function, and this normalization constant is actually the evidence. We've sort of forgot about it so far, but it's actually computed in our code. It's in the Gaussian Bayes class. Sometimes also called predictive distribution for the data and under the prior. And it actually has the form that it's a probability distribution. It looks like a Gaussian probability for the data evaluated at the prior mean projected to the observation operator and the prior covariance for the weights projected to the observation operator plus the noise. It's what you would predict the data to be under the prior. It's what the samples would look like that I showed you on this plot just now. So if we think about this distribution, then it measures how probable, whoops, how probable the data are under the model. How much do the data look like samples from the model? That's what we're trying to compute here. And what we would like to do is Bayesian inference with a prior for theta and then this thing as the likelihood. Now unfortunately, our parameters theta, they entered this likelihood in a very nonlinear fashion. So I said you can do closed form Bayesian inference with Gaussians if all variables are Gaussian distributed under the prior and the relationship between the variables and the data is linear. We could put a Gaussian prior on theta, of course, but this is not a Gaussian likelihood in theta. It's a Gaussian likelihood in the weights, but not in theta. So remember where theta actually enters for us, right? It enters in this model. So it is this stuff that goes into these kernels in a very, very complicated way. It's deep inside of the model, literally deep. Because what we're doing here is effectively a variant of deep learning, but you don't need to think about it. Actually, it is kind of. So in a deep neural network, you're learning the weights of the layers, right? These numbers and these lower layers. So the output layer, if this function here is linear, so if the loss is quadratic, I've mentioned this before. I'm gonna mention it again in later lectures. This is the thing that we can learn in closed form with linear algebra, but the stuff below will generally be of a, like enter the loss in a nonlinear way or it enters our evidence in a nonlinear way. And so optimizing it will take a little bit of time or it'll take a little bit of computation, a little bit of effort to get the computation right. So should I, well, actually what we're doing now is we're not modeling the function in this parametric way. We don't have individual weights. Instead, we have a kernel which says there's a function and then there's some parameters going in. But other than that, it's pretty much the same thing. So what we can do is not full Bayesian inference, but what we can do is we can find the mode of the posterior. Just like in exponential families, we said if we don't know what to do anymore, we'll just find the mode of the posterior and then maybe we could even do a Laplace approximation, a second order derivative, but we're not gonna do that yet today, we'll leave that for a later thing. We're just going to find the mode of this object, this evidence is also on this slide that you might want to look at later, is actually we're trying to compute, I'm gonna talk about this maybe in another lecture after the end break, a quantity that has some interpretations of the log of the likelihood of this marginal expression of this evidence, the likelihood for the parameters, looks like this. You can even stare at these expressions and kind of try to get an intuition for what they actually, what we're actually trying to do when we optimize these parameters and I'm not going to do that now because I want to save some time, but yeah, I'm gonna talk about this in another lecture, maybe after the end break. Because actually the reason why I don't need to talk about this is about 10 years or so ago, a little bit less, eight years ago, when I taught lecture like this for the very first time, we would have spent an entire lecture on this object and I would have talked about how you compute it and if you want to optimize it, we will need the gradient for it and then we would go through some lengthy derivation that looks like this to compute the gradient of this loss function and we would do some cool linear algebra, no, sorry, linear, no, sorry, differential algebra analysis on matrices and vectors to figure out what the gradients actually are and then sort of come up with some complicated answer, but actually today in 2023, we don't need to do any of this anymore because we have auto diff and you should appreciate the fact that in your generation, you often don't have to think about gradients anymore. Maybe we have time in some later lecture to think about gradients for a bit because these days, we just do this. We just say, let's load a library that can compute gradients and maybe even hashians and then just hand it to an optimizer. So what I'm doing is I'm defining a loss function called the negative log evidence. Actually, I should have called it negative log evidence. Why negative? Well, because I'm gonna hand it to an optimizer and optimize those minimize things by definition and I want it to maximize, so I put a minus in front and I'm just computing the log probability for the data under the prior. And that's a one line thing. It just says take the parameters, build a Gaussian process from it and then evaluate the negative log probability for the data under this prior, period. And then I say, Jax, help me out, compute the gradient please. And it does. And I never think about all these slides with the complicated math anymore. I used to, but I don't do anymore. And then I say, okay, let's start with some initial guesses that I've defined above and then let's call a nice optimizer from the sci-fi called conjugate gradients. You could also call BFGS. If you really want, you can call gradient descent, but that would be not so nice because you would have to babysit it a lot. So I'm just gonna call conjugate gradient and the only thing I now have to do is just wait. We've all gone accustomed to waiting in machine learning a lot. Ha, done. Okay, so it has 100 iterations. Actually, it has evaluated the function 200 times and the gradient one time less, smart. And it has found a function value of 215.56691. Is that a good value for a probability? We don't know because it's a logarithm of a likelihood and likelihoods are not normalized in the parameter space. So this is a probability for the data. Is a probability of 215 in the log space, actually minus 215 in the log space a good probability? We don't know because there might be other models that are better. And maybe there's another model that has 209. Maybe there's one that has 150. I don't know, but I've just optimized it. This is the problem with maximum likelihood that I'm trying to get to. You don't really know whether the value is particularly useful or not. But it's an optimum. Actually, it's not an optimum. The optimizer hasn't converged yet. It says warning maximum number of iterations has been exceeded, but I didn't want to wait any longer. So you can run this code at home on your machine for like 300 iterations, it might find something better. Actually, what you should probably do is you should initialize it with a few more other values and see if you can find some other local minima. Because this is something like what 10 dimensional problem and it's not a particularly easy function to optimize. So it might just find different values. So what I've done here is I've written a little piece of Python string wrangling to make a list all the outputs. So here is what we initialize the parameters with. Those are the ones that I put in. This is what the optimizer found. And by the way, the period is not being optimized. I've set it by hand, hard to 1.0. And these are some other values that Karl Wasmosen actually quotes in his book that he found with the same method pretty much. Of course, he used data from 13 years ago in 2007 or 16 years ago, but you know, humanity hasn't really progressed much in these 16 years, unfortunately. You can use those exact numbers yourself in the model as well, if you like. And you'll find that you'll get a plot that is actually quite similar to the one we are using. I'm also inviting you to play around with these numbers or with the model and try out lots of other things. You might get slightly different predictions. Yes? So the question is, are the parameters correlated? They are certainly correlated. They're not just correlated, they're also nonlinearly related to each other. The manifold on which the optimal values lie is something complicated, nonlinear. And many possible values might work. And you could spend a lot more time trying to get better parameter choices or several different ones. You could find many of them and try to sample somehow from this distribution. You could try and build a Laplace approximation. If we had only a bit more time, we could try all of this, but you want to see a plot, right? So let's just run with those numbers and take them. And by the way, just in your head, notice the connection to a lot of other machine learning as well. We've now decided to just run this thing once. We ran it until the time ran out, sort of a few seconds because you all want to go somewhere else. And now we're just gonna look at the output. And it's a little bit unsatisfying, but that's just what it is. So this is, actually, let's look at this first. This is the prediction. You can see that it actually looks quite a bit like the original one that I set up, which is maybe a good sign because it means that the model is a bit robust to how we choose the parameters. And here are some samples from the prior. Actually, the data is in there as well. Can you find the data? Maybe by now you know what the data looks like, but maybe you can convince yourself that these are pretty decent samples. Like, if I wouldn't have shown you the data before, just, like, pick this out of your head at first in the morning and said, can you find which data is real, which one is simulated? Hmm, not sure. Okay, so this is our prediction. And now we have 10 minutes left, maybe less than 10. Let's see if we can do one final cool thing. So this is our prediction now for the future of humanity and actually it's past. And I hope you agree with me that this is from a modeling perspective, not in terms of the outlook for us and our climate, but from a modeling perspective, this is a pretty decent model. It does something quite interesting. In your head, you could go, is this actually what climate scientists do? And I'll leave you to think about it as you go to your next lecture, because of course the answer is to some degree yes, to some degree no. So I'm not claiming that this is the only way to make a prediction for the climate, absolutely not, for sure not. And the IPCC report is way more complicated. Maybe one thing just to point out is there are no dependent variables in here. There's no other axes for X, right? So you could of course try to use in this model something else, some other covariates that you have measured. Maybe you know something about how much of the electricity is produced with renewables. Maybe you go to the World Bank and download some data about global economic output, about GDP and see whether it has something to do with this. Maybe you take the total number of people on the planet and try to relate to that. Whatever your political agenda is and whatever thing you want to push on Twitter, you can add dimensions here and use them for your argument, right? Maybe we'll do that as an exercise after the land break. The other thing you might want to do is to say, okay, so we started out with this observation that our data actually shows several different functions combined together. There's this periodic trend, there's the local trend, there's a long-term trend and there's noise. And they are all in there now and our model predicts them all together. That's kind of nice, it makes a nice plot, but it might be useful to think about what actually the individual components are. Can we reason about those as well? And thankfully, that's something we can actually do in closed form and it's called source separation. So I just have one slide on that, which is meant to just show you a bit of math and then I'll show you the output. So I showed you in a very, like a while ago, this nasty slide with Gaussian algebra where this stupid big equation showed up and then I keep waving my hand at it and saying, this is the most general thing you could possibly do and then you never really look at it properly because I just say it's the most general thing, whatever. But what this actually says is if you have Gaussian random variables that are distributed according to some joint distribution and then you observe data that is related to these latent quantities in a linear fashion through Gaussian noise, then not only is the posterior over X a Gaussian but also the posterior over any affine projection of X is also a Gaussian. So if you put a big B and a little C in there as well, you can still do Gaussian inference. And I mean it makes the expression more complicated but it's still just linear algebra. So we started before the break with this multi-output setting where we said actually there are several different functions, F1 and F2 and maybe others. And now in our model there are like at least five, right? Long-term trend, periodic, mid-term trend, noise, small different types of noise and so on. And what we see is a big vector one applied to all of them and that's our data. And now what we'd like to know afterwards is actually the individual terms. We can get those terms out again by applying this operator called unit matrix which takes the one-dimensional observation and projects it back out into a multi-dimensional output. And then you can look at the individual rows of this and get a prediction. If you apply this to our actual Gaussian process model, you can maybe convince yourself just by looking at this equation up there that this means we have to compute, if you for example want to predict just the first function, a Gaussian process posterior over the first function which has posterior meaning covariance functions that look like this. And then if you stare at them for a while you might realize that this bit here is the bit that we use in our inference anyway. It's the covariance over the data under the model, inverse from the left onto the data. But then we're taking another kind of kernel to the left which is like this B operator, the projection out into another space. So we can actually take the Gaussian process inference that we've just done and then sort of do a linear map from the side onto it. And the same actually happens here, it just comes from the left and the right. So in this Gaussian code that I've now shown you a bunch of times, there's actually always been a function that I haven't yet talked about called project which takes the Gaussian process with its constructed covariance matrices and just projects onto a different kernel on the left. You can look at the code and how it does that. But we can also, I can also just show you actually, this is the wrong thing, I need to go here. What it does, so if I take our model which we've already trained, that's the thing from above and project it in particular onto the long-term trend and the mid-term trend, let's do that. Then I can make a nice plot together in one set of axes so that it looks nice and convincing and just show you the individual components for these two, which just takes a bit of time to compute. Here we go. So in the back in this plot, you see in blue the data, in red, the marginal distribution over the long-term trend and in green, the marginal distribution over the mid-term trend. And the red thing uses this y-axis and the green thing uses this y-axis, so it's scaled up. And what you can see are some trends over the data. You can sort of see the data right, it kind of moves up here a little bit and then came back down. You can see it even in the data in blue and just reflect it in here to zoom in in this local trend. And you see that both of these are associated with uncertainties, actually. And maybe you can think for yourself about how broad these uncertainties are if you compute marginals over functions, so only components of the model do you become more certain or less certain over them because they are mixed into the data. You can draw a graphical model for yourself if you like, you don't have to. Yeah, so this allows us to think about which parts of this are, well, not whether, but short-term things and which part are long-term things. And actually, this is a very sad state of affairs for the long-term trend here. Of course, I mean, we haven't really played much with this model, I'm not saying this is exactly how the future has to look, but it's not particularly promising. So this is just a map that takes a vector and sorry, a scalar input and projects it out onto, ah, sorry, this is actually the wrong way around. I should have just, yeah, you're right. It's not the identity, it's just a one-one vector. So yeah, I need to correct that. So it's just one comma one and no zero because you want to apply it to a scalar thing. Yeah, I'll fix that in the slides later. But basically, this expression down here is correct. By the way, you can also do the same thing for the periodicity and Calvasmos and Dustis in his book, so I've just done it as well. You can plot just the periodic component. So here I'm taking the posterior and project it onto only the periodic component. And then I plot, because it's a periodic function, it's a bit difficult to plot. You can't plot it in the original axis, right? Because the original axis are not the right coordinates. So instead, I'm plotting it like this, where this is the time from 1960 to 2040 and this is in each year, the month. So you can think of like a helical way of looking at this data going up through time for each year. And then I only thought that putting the posterior mean shaded as an alpha channel with the uncertainty, with the standard deviation, no samples because otherwise the plot would look crazy. And so clearly I've chosen a color scheme that in red means more CO2, so warmer somehow, right? And blue means less CO2. And now we can look at this thing and kind of find all sorts of interesting structure. So for example, Calvasmos in the book, he kind of goes, oh, this seems like there is a trend where the higher CO2 content tends to move towards earlier months. Not quite sure, maybe there's also a bend and actually this whether there's this bend in there or not depends on how you set the parameters. So you can play with this a bit if you like, but maybe what you see is that up here, the plot becomes white. And that's not because of the color map. It's because of the alpha map. It's because I'm shading by uncertainty. And in 2040, the model says I don't know anymore what the periodic component will look like because that's just the scale that we've chosen. So at this point, I will leave you with a final slide and also let you soon go into the vacation from Saturday onwards. Today we did a concrete example of Gaussian process regression. We will do more of these examples in later lectures, not the very next one. What we saw is that you cannot just apply any textbook model to any data. You can never do that. No matter whether it's Bayesian or Frequentist or whatever. But every time you just use a deep neural network without structure, that's effectively what you're doing. By looking at samples from the prior, which is a very probabilistic thing to do, you can check whether your model has a hypothesis class that fits the data well. That's fundamentally an advantage of probabilistic models. You can even use the difference between those samples and the data in a mechanical computational way to optimize the model, so that it predicts the data well under the prior. And then condition on it to construct predictions like the ones that I've shown you. And doing that right involves a little bit of Bayesian inference and for us also still a little bit of hyper parameter optimization. So marginal likelihood optimization. We will talk more about what this actually means after the land break. My plan for after the break is we will have one lecture with a lot of math to counteract this lecture with a lot of pictures to talk about what exactly is actually happening in these models, how they are connected to the statistical machine learning class that you may also have been taken and how to understand what this non-parametric thing actually means. And then we will have two lectures or so to think about how to do the computation efficiently and effectively. And then start a big part of the course for like several weeks where we really think about what the connection is to deep learning and whether we can join the two so that you can work with deep neural models with the same kind of expressiveness and flexibility that you just encountered with Gaussian processes. Thank you very much.