 Thank you very much Matteo and thank you everyone for bearing with me and so today we're going to take a little step back again on business function regression, because I do remember that last week I had made, well, let's say I was less than totally clear in my final explanations. Let me share the whiteboard and hopefully we'll make it clearer this time and but I mean should you have any questions. Then just Interrupt me and ask. So, continued. Okay, so the setup is we want to perform regression that would be the standard setup, where we have observations, why I that are a function of some input Xi typically will take why I to be a scalar and Xi to be a vector, plus some linear Gaussian noise with variance sigma squared zero mean, and we make the the unsat that f of X I in the case of linear regression. In the case of linear regression. We did assume that this was an affine function of the inputs. And in this case instead we're going to say it's a it's a linear combination of a set of nonlinear features, which we also call basis functions. So you can think of these as if you are extracting nonlinear features out of your input. But your nonlinear features are fixed functions and they're called basis functions and we sometimes for convenience will write it as a weight vector times a vector. And dot product with a vector we're now the capital Phi is the vector of Phi one up to Phi. Let's say D if we're extracting the features and now what we have seen is that a priori. So from from from the prior if we put w as a normal spherical zero mean Gaussian prior over the weight vector. Then that means that the function itself F is a random function. Okay, because it's a it's a linear combination with random coefficients of a fixed set of functions so it's value at any one point is a random variable and a priori expectation on the P of w. F of x is zero everywhere and expectation on the P of w. So under the prior over the weights of f of x i f of x j by a very simple calculation. So it turns out to be Phi transposed off. Oh, I wish this didn't happen. What is it. Phi transposed of x i. Phi transposed Phi not transposed of x j. Okay, and so what we've seen is that the variable F a priori is a random variable. With zero mean and covariance the two point correlation function is uniquely determined by the fees. Okay, by the choice of the set of basis function that tells you what type of correlations you model admits a priori. And in particular, one thing that I tried to make clear. So if the support of fee is well or I mean if as is natural, let's say, suppose now that if fee tends to zero and say x tend to infinity. And because the covariances are like this the correlations are like this. The variants which is the one that you evaluate when you take the expectation of f of x times f of x is a square of these five. And if your x happens to fall to be very large. So to be in particular to be outside of the region that you wish to model. So what's happening here. And for large x, then the the variance of f of x would become essentially zero. Okay, so a priori. So this formula, your choice of having basis function if you make a choice of having basis functions that are localized. Then you are basically saying that your function is not only in expectation zero everywhere. But also the level of confidence you have on this prior expectation which is expressed by the variance becomes extremely high as x becomes large. Some people said okay well I made it a kind of example of saying, suppose you have data. Okay, then what I was trying to say is that you know you typically in this case it would choose your, you will choose your data to be your basis functions to be localized where the data is that's a typical choice that people make so you might want to choose these as a set of basis functions maybe you could go a little bit beyond the data, just to be on the safe side. But this kind of choice would imply that, you know whatever combination of weights you choose your function will always be zero and certainly zero, when you go very far from where the data is. Now people objected well but if you have data then you're not going to take points from the prior and that is very true. So if I have data, then the kind of distribution under which I would compute my expectations of the random function is no longer the prior would be the posterior. So the question now becomes, what is the posterior over the weights. Well, so the situation is now a lot of set a lot of pairs. I'll have a set of ID observations why I xi for i equals one to N, and I have them independent and identically distributed. And I'll have my regression equation which is why I w transpose phi xi plus epsilon i. Now, all the calculations will go through exactly in the same way as they did into linear regression except that instead of having x, you will have phi of x. So what is going to be how are we going to compute the posterior over y in this case over W in this case. Okay, so P of W condition on the data and also on the noise variance sigma squared. Well, it's going to be proportional to the likelihood, which is a product equals one to N of Gaussian terms. Okay, so these are likelihood terms times the prior. And since it's the prior, we have that the prior over W is its continuity variance prior over epsilon and the. And now I've got a product of exponentials. So that becomes the exponential of the sum. And then I will have a term plus. Now, of course, this is still going to be a quadratic form in W, because there is nothing to, you know, nothing different from the fact that we're simply extracting non-linear features out of our axis. And as a result, this is still a Gaussian. And as a Gaussian, I should be able to write, I should be able to write it as normal of certain mean M and a certain covariance C. Okay, now I have to learn what this is a matrix and this is a vector. Now I have to learn what the mean of the posterior and the covariance of the posterior are. And how do I do that? Well, I have to look at what are the quadratic terms in here to identify what C is and what the linear term in here is. So if I do that, then I get that C has to be equal to you see C is the quadratic term here would be W transpose C minus one times W. C minus one is going to be equal to an identity plus a sum one over Sigma squared. So instead of a sum, I can just put an N plus or Sigma squared. Okay, times. Sorry, this is wrong. That's what happens when you try to do it. It's going to be a sum one of a Sigma squared. I was one to N fee computed at X I times fee transposed computed at. So if you look here, you know the quadratic terms I will have W transpose fee of Xi times fee transpose W of Xi. So we will have a W transpose fee fee transpose W. So this is the inverse of the covariance and I'll just invert both sides just to have the actual covariance. Okay, so this is the posterior covariance over the points. And the posterior mean, well on this side I will have something like W transpose C to the month. Okay. So one side, I will have C to the minus one M. That's a coefficient of w. So on this side. Well, there is no linear term here but the linear terms come from here and they would be of the form W transpose fee of Xi times Yi divided by Sigma squared and some over N. Here it's some I was one and fee one of the Sigma squared fee of Xi times Yi. So if I multiply both sides by C, I'm going to get the posterior mean is equal to C times this vector which is a linear combination of the basis functions. Yeah, evaluated at the observations to one of the Sigma squared here in the front. So this is the posterior over the weights. Okay. This is the posterior over the weights. So if I were to go back to this setup. The functions, the ensemble the random functions that fit the data somehow are functions obtained as a linear combinations of these basis functions with weights drawn from the posterior. Yeah, drawn from this Gaussian with this mean and this covariance. Okay. What changes. Well, the change, the main change knife. Yes, there is a question I can see is there any so okay so there's a question is there any significance if the mean would be equal to zero. So the mean, you know, the only way in which the mean would be equal to zero. This is not known as an invertible matrix and so in general these won't do anything. So for the mean to be equal to zero. You can either have your fees that are zero at the point where you have the input observations which would be a very odd choice because remember that the basis functions is something that you choose will be very unfortunate. So before that your observations are zero, if all your observations are zero then you mean would be zero and your function would still be zero in expectation everywhere but in practice you don't expect you know if your data looks like what we have in the figure on the outside, you know if your data looks like this, the weights will not be zero the weights will be such that the expected function, let's draw it in orange so the expected function will probably be something that goes somewhere in the data. And then goes back to zero when you're far. So this would be kind of you know, this would be expectation under P of w, given y of x. So you don't expect your posterior mean function to be zero. But what do you expect it what what is it going to be. So let's do this calculation now. So we need to compute expectation under the posterior square, which we see conditioning of f of x, and is f of x, we can. So, and I'm going, I'm now going to write it the posterior, a little bit more succinctly. Yeah, so this thing I'm going to call the data. So this is w transpose times phi of x. And now the expectation is linear, phi of x is not a random variable because these are fixed functions. So, this is going to be equal to the mean times few x, the vector made up of the basis functions. So in this case, where the set of basis functions was a finite set. And they all had the property that they went to zero as x became large, we're still getting something that. This is a set of is a constant. Yeah, I mean, once you plug in Xi and you plug in Yi, then this is just a constant vector. And it's times a set of functions all of which go to zero at infinity. So this still goes to zero at infinity. What about, and that's okay, we don't mind that. What about the expectation under the posterior of the variance or the of the function. Now here, unfortunately, I do have to subtract the mean, because now this is no longer zero mean so the computation is a little bit more complicated. Okay. But you don't have to worry too much about this. Because, as you can see, again, this is a constant matrix. So, the second moment, expectation of f squared is still going to be an expectation of w w transpose type thing, multiplied by a Phi Phi transpose. So you still have a Phi transpose, a Phi transpose of x times now the, the, the covariance matrix, plus some other terms but they all contain Phi squared. So this also will, I mean, you can work out this calculation as an exercise in detail, but also under the posterior, what is going to happen is that this will still tend to zero as x becomes large. But not that large, you know, basically what that means is that if I take functions of this type basis functions of this type that cover the data, but just a little bit to the left and to the right. I actually will not be able to make any meaningful prediction because a prediction of zero with infinite confidence is not a meaningful prediction. I'm far from the data. Yeah. So any prediction with infinite confidence is not a meaningful prediction in a Bayesian setting. So the fundamental flow of basis function regression remains whether you choose to use the prior or you choose to use the posterior measure to average over your weights. And that makes perfect sense because if under the prior, what you get is that, you know, you have a priori zero variance on the random variable function of x where x is far from the data. There is no data that will be able to change a zero variance. So in the Bayesian setup, you can't, you know, something is certain evidence cannot move certain. Okay, by the way, and Matthew told me that you've requested some exercises. So I'll pass some exercises to the core secretary. And but in general, if you look at the books that I've recommended, they do have exercises and they tend to be good exercises. But you will have noticed that sometimes when I'm too lazy to do a calculation in detail, I'll say I'll leave it as an exercise. Well, that's also good exercise. Okay, so here's a good moment to take some questions. So, you know, we've stated the issue that you identified last time that yeah okay I mean the prior may have zero me zero variance but but you know you're certainly not going to draw samples from the prior once you have seen some data. Even if you take samples from the posterior which is what you would do. You're still going to get all of the functions that you sample from the posterior. All the functions that you obtain linear combinations of the basis functions with weights drawn from the posterior, they will all converge to zero when you're far from the data. And therefore, or when they're far from the support of the basis functions. And therefore, you're going to have infinite certainty predictions away from the data. That's the fundamental flow of basis function regression. Excuse me, what is this method used for them. This method is used for interpolation. So if you have. It's it's useful for interpolation it's a good method because, you know, suppose. Well, let me give you an example from my actually my own work a few years ago I wrote a paper that became relatively well known where we were analyzing using techniques, not entirely dissimilar to these, the dynamics of the war as inferred from the data from WikiLeaks. Yeah, so these are almost 10 years ago now but you probably still know that WikiLeaks was this association that leaked a lot of geolocated data in Afghanistan and so if you draw Afghanistan which looks a little bit like this. What the data was telling you you had a lot of dots along the country. And these were conflict events and so what we did we we modeled the dynamics of these across five years from 2004 and 2009 and then we showed that you know you could, you know this would be 2004 and this would be 2009 and then you could use the learn dynamics to predict what would have happened in 2010, which you could do reasonably well. And how did we do that where we postulated this dynamical system where there were some latent variables which were combinations of basis functions, which gave you the probability of an event happening. So the thing that evolved in time were the coefficients of the basis functions. So you would have a set, we had a fixed set of basis functions, which essentially covered most of Afghanistan but it didn't cover all of Afghanistan if you will, if you will have noticed that I've drawn the dots, kind of on a circle, and the reason is that there is something called the ring road in Afghanistan which basically there are some mountains here in the middle, which are essentially inhabited. We have no data whatsoever here. So in that case, it was reasonable to use a set of basis functions which were two dimensional and would be localized to the positions where we were interested in making predictions. I mean, we were never interested in saying, okay, so we were interested in learning the dynamics making time ahead predictions. But we weren't interested in saying, okay, let's, we've done Afghanistan, let's predict what's happening in Kazakhstan because we would have predicted zero with infinite confidence. It's just not something that we want to do. And so you predicted future conflicts in this way. Well, we predicted future events within the same conflict. So if you're interested, then the reference is, and I mean it off uses things a little bit more advanced because it uses these things but But I have another question. Suppose that you have, suppose that you have a dynamic, which is very high dimensional. So you have a very high dimensional dynamic. And you find out that it follows the trajectory. So you have very high dimensional data. And the experiment is repeated, for example, every day for a long time. Every day they do the experiment. And the data you have are very high dimensional. You do a little bit of PCA and you find out that the intrinsic dimensionality of the data is almost one. So it's like a point trajectory in this very high dimensional space. So, in this case, I'm moving through space so I can't really use this kind of models because these require to be some somehow localized in the future. Well, I mean, it depends because you know your data. So take this example. Yes, so the data was 2d. But in effect, the dynamical system that we were defining was we had about 100 basis functions who were defining a dynamical system over the 100 dimensional space of the coefficients. And then of course, this will have swept an orbit in that or, you know, a trajectory in that. So it gave a measure or something that would have had lower dimension like the whole 100 times the number of time points but you know that was still okay. But yeah, so these methods are used primarily for regression or geostatistics or things like that. Yes, my question in the end was, what if I have something like that something like what I described and I want to predict the next step. What if I'm able to use a model like this, like I can predict for example the next difference because if then I pass the differences between steps that they in something localized in space. So, so I mean, so, you know, we're moving a little bit further from the topic of the lectures but what you would, you know, these models do not have a concept of time, as I mentioned it now. Okay, so I have a cartoon which is one dimensional, but this is not time. Yeah, I'm not using it as time. So predicting time. You have to have some sort of dynamics. So you have to say okay there is some law that tells me what x t would be given previous observation, previous values of the state. Yeah, and this could be a Markovian dynamics or it could be no Markovian it could be something that has coefficients that you learn from data by observing a time series or whatever but you still need to have that. While here, we're not f of x is not something that has its own evolution is something that we just see instances of, and there is no concept of time at all. So, I mean obviously time. Time series analysis is a very, very interesting topic in its own right, but I'm not covering it in these lectures but you know we can talk about it separately if you wish. But but for the time being I would like to carry on on regression, which is still a very useful thing. Okay, sure. Thank you. You're welcome. Are there any more questions. I don't see anything in the chat so I assume that everything was relatively clear. And so what I wanted to talk to you now and this might be really reasonably that the final topic of this mini course is something called Gaussian processes. Which is a way to avoid the pathologies that we've seen in the case of basis function regression. So basis function regression gave us the opportunity to model some more complex relationships not just linear relationships. And, but came with the cost that the choice of basis functions essentially completely determined in some cases some of the properties of the functions you were going to produce from your random model. But importantly, they showed that there exist probability distributions of the functions. Now, there is a formal mathematical way to go about, but you could take the case of if you take an infinite collection of basis functions. Yeah, so you see all of the problems that we were having of certain predictions with absolute certainty far from data were linked to the fact that the basis functions had to be chosen to be finite, a finite set. And therefore that caused the pathologies if we were able to take an infinite set of these functions, which spanned the whole space. Then, what we would see is that we would get show your your covariance would still be determined. Come on and work for your covariance would still be determined by the choice of basis functions. But what was it. Yeah, but your variance I mean if your set of fees does not go to zero infinity but you keep having fees. The variance would not be zero when you're far from the data would be whatever the square of the function there is. And that is the idea beyond Gaussian behind Gaussian processes so they're essentially there are regression with an infinite collection of basis function. The reference for this is the book by Rasmussen and Williams, Gaussian processes for machine learning. And it's chapter two. Yeah, chapter two I very much recommend reading it also it gives you a very good description of basis function regression. So, the, the point of view of an infinite collection of basis function is what people call the weight space view. And here I refer to the terminology used in this book. I'm going to concentrate on what people call the function space. This is the one that we know we will use this. The function space view of Gaussian processes essentially it's a more axiomatic view and the definition. So a Gaussian process is an infinite collection of random variables indexed the index. X, which lives in order to such that for every finite collection. X and the vector obtained by evaluating. So it's an infinite collection of random variables. The vector F of X one, F of X and is Gaussian distributed. So we have that this vector here, F of X one, F of X and is drawn from. So we denote these by P mu K is a Gaussian distribution. So if we evaluate this infinite collection on a set. So if you take a sub index set from your infinite index X, then you have that. It's a multivariate Gaussian with mean given by the evaluation of this new function on the collection points. This is the mean vector mean function and covariance K. What K I J is given by evaluating a function of two arguments on pairs of input points. So this is a fairly abstract definition. What it does is the GP is the stochastic process, which is an infinite collection of random variables indexed by whatever variable that we call the input, such that any finite dimensional marginal is a multivariate Gaussian. Why do we call it finite dimensional marginal because you have to think of having marginalized every other value of X except for these n ones. And so if we take n input values. Then we obtain a vector evaluating these random variable over these input values. So we draw a random variable which is an infinite object. And then it's got an infinite index set. Then we take a finite subset of that index set. In this way, we get a finite set of random variables, and they happen to be Gaussian. That's the definition. They are Gaussian. If they're Gaussian and they have a mean and a variance covariance matrix, how do I get the mean and the variance covariance matrix well, there exists two functions one function of one variable. Mu is a function. Mu goes from the input space R to the D or the index set if you prefer to R. And K is a function of two variables. So it takes pairs of inputs and returns one value, which is the covariance between those two, the covariance of the random variables. That X for I function evaluated X J. Okay, so this. Now, avoids the problem of basis functions regression because it all depends of course on what is on what this K function is. A typical choice is stationary is of stationary Gaussian processes or stationary stationary Gaussian processes. That means that K of Xi X J is a function only of the distance between Xi and X J. So in practice this implies that the the variance f of X. Okay, which would be K X X minus M squared of X. If M, it only depends on on on on X through M. Yeah, because if it is stationary then that means that you know this is always zero. And so it's always the same. Yeah, because if I plug the same value here I get K of zero and that does not depend on X. So it only depends the variance would depend only on the on the mean. So that's already. Sorry that that is already by definition so K is not the second moment it's already the variance covariance matrix so the variance is K. X X and so this is independent of X. Okay, so from this definition if I have a stationary Gaussian process I end up with prior distribution over functions such that the prior variance is constant over the whole input space which is the pitfall of basis function regression. Now the questions. I mean, you might well ask do Gaussian process exist. And the answer is of course, yes. So what do we need to do to get a Gaussian process well I need to create such a collection. Oh sorry that the pen seems to be really giving me trouble these days. Clearly, the choice of the mean is irrelevant I could just instead of taking as a random variable F I could take F minus mu, and then I will get a zero mean Gaussian process. So what I need to worry about is the covariance. To get to get a valid Gaussian process. You need K of X I X J to provide a valid covariance matrix for every choice for every set of X I say X one X. Okay. So the question is, I've given an abstract definition does this correspond to any reality. Well, in order for it to be a real object that exists, I need to be able to construct it and to be able to construct it I need to be able to evaluate this function of two variables on all pairs of input points, and the one that comes out needs to be the covariance matrix of a Gaussian process of a Gaussian of a multivariate Gaussian. Yeah, and the covariance matrices of a multivariate Gaussian are not any type of matrix. But what you need is for it to be needs to be symmetric. So K X I X J must be equal to K X J X I. And it needs to be positive definite. And there is a theorem, which is called mercers that states. Under which can be a covariance function. The practice is that most covariance functions are constructed from a few existing and well known variance functions. Yeah, so the best known of all is the so called the radial basis functions covariance. I saw something in this is something in the chat. No, really a basis functions covariance. So that means that K of X I X J is equal to an amplitude parameter a a squared needs to be positive times the exponential of minus X I minus X J squared divided by the square. Okay. So this is the by far the most widely used covariance function and has two hyper parameters. The first one is the amplitude length scale, which characterized the type of functions that you can sample from these particular guns and process. Now, we are a few minutes from the end of the lecture. But I think this is a fairly reasonable position to stop before starting showing you how you do computations, and how you use in practice, Gaussian processes so if you have a few questions, this is probably the time to ask. And then tomorrow, we'll see how Gaussian processes are used in practice and, you know, a few topics that will be potentially extensions, you may want to look into. So I saw there was a question here. So the question. I mean, it's probably best if you, if you ask the questions to everyone, not just to me so you said you use this method so basis function regression for for my research so I thought why no other interpolation methods, because I don't just want so you know, I don't just want to interpolate between two points, I want to have something that is defined as a function so that I can use any new query point and obtain a value. Well, you know, Newton interpolation well I mean, you know, you could do linear interpolation and that would be like, for example, fitting a piecewise linear function. And that would also be understandable in terms of basis functions but would depend entirely on on the data because your basis functions will be determined from the data that would be linear between data points. And that's it. Instead we want a global basis functions that fits the data. But it's generally got fewer parameters than just interpolation between every pair of points. Any more questions, but if there are no more questions. We'll stop here. And we'll pick it up from where we left it today, tomorrow at the same time.